Introducing spaCy v3.2

Nov 5, 2021
6 minute read
Blog
spaCy
Embeddings & Vectors
the spaCy team

We’re pleased to present v3.2 of the spaCy Natural Language Processing library. Since v3.1 we’ve added usability improvements for custom training and scoring, improved performance on Apple M1 and Nvidia GPU hardware, and support for space-efficient vectors using floret, our new hash embedding extension to fastText.

The spaCy team has gotten a lot bigger this year, and we’ve got lots of exciting features and examples coming up, including example projects for data augmentation and model distillation, more examples of transformer-based pipelines, and new components for coreference resolution and graph-based parsing.

Improve performance for spaCy on Apple M1 with AppleOps

spaCy is now up to 8 × faster on M1 Macs by calling into Apple’s native Accelerate library for matrix multiplication. For more details, check out thinc-apple-ops.

pip install spacy[apple]

Prediction speed of the de_core_news_lg pipeline between the M1, Intel MacBook and AMD Ryzen 5900X with and without thinc-apple-ops. Results are in words per second.

CPU	BLIS	thinc-apple-ops	Package power (Watt)
Mac Mini (M1)	6,492	27,676	5
MacBook Air Core i5 2020	9,790	10,983	9
AMD Ryzen 5900X	22,568	n/a	52

Doc input for pipelines

nlp and nlp.pipe accept Doc input, skipping the tokenizer if a Doc is provided instead of a string. This makes it easier to create a Doc with custom tokenization or to set custom extensions before processing:

Process a Doc object
doc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)

Registered scoring functions

To customize the scoring, you can specify a scoring function for each component in your config from the new scorers registry:

config.cfg (excerpt)
[components.tagger]
factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

Support for floret vectors

We recently published floret, an extended version of fastText that combines fastText’s subwords with Bloom embeddings for compact, full-coverage vectors. The use of subwords means that there are no OOV words and due to Bloom embeddings, the vector table can be kept very small at <100K entries. Bloom embeddings are already used by HashEmbed in tok2vec for compact spaCy models. For easy integration, floret includes a Python wrapper:

pip install floret

To get started, check out the pipelines/floret_vectors_demo project which trains toy English floret vectors and imports them into a spaCy pipeline. For agglutinative languages like Finnish or Korean, there are large improvements in performance due to the use of subwords (no OOV words!), with a vector table containing merely 50K entries.

Finnish example project with benchmarks

To try it out, clone the pipelines/floret_fi_core_demo project:

python -m spacy project clone pipelines/floret_fi_core_demo

Finnish UD+NER vector and pipeline training, comparing standard fastText vs. floret vectors. For the default project settings with 1M (2.6G) tokenized training texts and 50K 300-dim vectors, ~300K keys for the standard vectors:

Vectors	TAG	POS	DEP UAS	DEP LAS	NER F
none	93.5	92.4	80.1	73.0	61.6
standard (pruned: 50K vectors for 300K keys)	95.9	95.0	83.1	77.4	68.1
standard (unpruned: 300K vectors/keys)	96.4	95.0	82.8	78.4	70.4
floret (minn 4, maxn 5; 50K vectors, no OOV)	96.9	95.9	84.5	79.9	70.1

Results updated on Nov. 22, 2021 for floret v0.10.1.

Korean example project with benchmarks

To try it out, clone the pipelines/floret_ko_ud_demo project:

python -m spacy project clone pipelines/floret_ko_ud_demo

Korean UD vector and pipeline training, comparing standard fastText vs. floret vectors. For the default project settings with 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

Vectors	TAG	POS	DEP UAS	DEP LAS
none	72.5	85.3	74.0	65.0
standard (pruned: 50K vectors for 800K keys)	77.3	89.1	78.2	72.2
standard (unpruned: 800K vectors/keys)	79.0	90.3	79.4	73.9
floret (minn 2, maxn 3; 50K vectors, no OOV)	82.8	94.1	83.5	80.5

Results updated on Nov. 22, 2021 for floret v0.10.1.

New transformer package for Japanese

spaCy v3.2 adds a new transformer pipeline package for Japanese ja_core_news_trf, which uses the basic pretokenizer instead of mecab to limit the number of dependencies required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for their contributions!

New in the spaCy universe

The spaCy universe has seen some cool additions since the last release! Here’s a selection of new plugins and extensions you can install to add more power to your spaCy projects:


💬 `spacy-clausie`	Implementation of the ClausIE information extraction system
🎨 `ipymarkup`	Collection of NLP visualizations for NER and syntax tree markup
🌳 `deplacy`	Tree visualizer for Universal Dependencies and Immediate Catena Analysis

The following packages have been updated with support for spaCy v3:


🕵️‍♂️ `holmes`	Information extraction from English and German based on predicate logic
🌐 `spaCyOpenTapioca`	OpenTapioca wrapper for named entity linking on Wikidata
🇩🇰 `DaCy`	State of the Art Danish NLP pipelines

View the spaCy universe