Introducing spaCy v3.3

We’re pleased to present v3.3 of the spaCy Natural Language Processing library. spaCy v3.3 improves the speed of nearly all statistical pipeline components, adds a trainable lemmatizer and includes new trained pipelines for Finnish, Korean and Swedish.

Speed improvements

spaCy v3.3 includes a slew of speed improvements that increase the speed of all core pipeline components in training and inference. For longer texts, the trained pipeline speeds improve 15% or more in prediction. Detailed benchmarks for en_core_web_md show the speed improvements for spaCy v3.2 vs v3.3:

Speed Benchmarks: en_core_web_md

CPU	Avg. Words/Doc	v3.2 Words/Sec	v3.3 Words/Sec	Diff
Intel Xeon W-2265	100	17292	17441	0.86%
	1000	15408	16024	4.00%
	10000	12798	15346	19.91%
Apple M1	100	18272	18408	0.74%
	1000	18794	19248	2.42%
	10000	15144	17513	15.64%

Trainable lemmatizer

The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!

displaCy for overlapping spans

displaCy now supports overlapping span annotation from Doc.spans:

New trained pipelines

v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use the new trainable lemmatizer and floret vectors. Due to the use of Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

Package	Language	UPOS	Parser LAS	NER F
`fi_core_news_sm`	Finnish	92.5	71.9	75.9
`fi_core_news_md`	Finnish	95.9	78.6	80.6
`fi_core_news_lg`	Finnish	96.2	79.4	82.4
`ko_core_news_sm`	Korean	86.1	65.6	71.3
`ko_core_news_md`	Korean	94.7	80.9	83.1
`ko_core_news_lg`	Korean	94.7	81.3	85.3
`sv_core_news_sm`	Swedish	95.0	75.9	74.7
`sv_core_news_md`	Swedish	96.3	78.5	79.3
`sv_core_news_lg`	Swedish	96.3	79.1	81.1

Pipeline updates

The trained pipelines for the following languages switch from lookup or rule-based lemmatizers to the new trainable lemmatizer:

Lemmatizer Accuracy (md Pipeline)

Language	v3.2	v3.3
Danish	84.9	94.8
Dutch	81.5	94.0
German	73.4	97.7
Greek	56.5	88.9
Finnish	-	86.2
Italian	86.6	97.2
Korean	-	90.0
Lithuanian	71.1	84.8
Norwegian Bokmål	76.7	97.1
Polish	87.1	93.7
Portuguese	76.7	96.9
Romanian	81.8	95.5
Swedish	-	95.5

New in the spaCy universe

Many cool new plugins, extensions, pipelines and tutorials have been added to the spaCy universe since v3.2:


Applied Language Technology course	NLP for newcomers using spaCy and Stanza.
Augmenty	A text augmentation library.
classy-classification	A Python library for classy few-shot and zero-shot classification within spaCy.
Concise Concepts	Concise Concepts uses few-shot NER based on word embedding similarity.
Crosslingual Coreference	Crosslingual coreference with an English coreference model plus crosslingual embeddings.
EDS-NLP	spaCy components to extract information from clinical notes written in French.
eng-spacysentiment	Sentiment analysis for English.
Healthsea	An end-to-end spaCy pipeline for exploring health supplement effects.
HuSpaCy	Industrial-strength Hungarian natural language processing.
Klayers	spaCy as an AWS Lambda Layer.
NER using spaCy	Named Entity Recognition with spaCy (video).
Scrubadub	Remove personally identifiable information from text using spaCy.
spacypdfreader	Easy PDF to text to spaCy text extraction.
spacy-setfit-textcat	Experiments with SetFit & Few-Shot Classification.
spacy-wrap	Wrap fine-tuned transformers in spaCy pipelines.
textnets	Text analysis with networks.
tmtoolkit	Text mining and topic modeling toolkit.