Go to top

Introducing spaCy v3.3

We’re pleased to present v3.3 of the spaCy Natural Language Processing library. spaCy v3.3 improves the speed of nearly all statistical pipeline components, adds a trainable lemmatizer and includes new trained pipelines for Finnish, Korean and Swedish.

Speed improvements

spaCy v3.3 includes a slew of speed improvements that increase the speed of all core pipeline components in training and inference. For longer texts, the trained pipeline speeds improve 15% or more in prediction. Detailed benchmarks for en_core_web_md show the speed improvements for spaCy v3.2 vs v3.3:

Speed Benchmarks: en_core_web_md

CPUAvg. Words/Docv3.2 Words/Secv3.3 Words/SecDiff
Intel Xeon W-226510017292174410.86%
100015408160244.00%
10000127981534619.91%
Apple M110018272184080.74%
100018794192482.42%
10000151441751315.64%

Trainable lemmatizer

The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!

displaCy for overlapping spans

displaCy now supports overlapping span annotation from Doc.spans:

displaCy for overlapping spans

New trained pipelines

v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use the new trainable lemmatizer and floret vectors. Due to the use of Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

PackageLanguageUPOSParser LASNER F
fi_core_news_smFinnish92.571.975.9
fi_core_news_mdFinnish95.978.680.6
fi_core_news_lgFinnish96.279.482.4
ko_core_news_smKorean86.165.671.3
ko_core_news_mdKorean94.780.983.1
ko_core_news_lgKorean94.781.385.3
sv_core_news_smSwedish95.075.974.7
sv_core_news_mdSwedish96.378.579.3
sv_core_news_lgSwedish96.379.181.1

Pipeline updates

The trained pipelines for the following languages switch from lookup or rule-based lemmatizers to the new trainable lemmatizer:

Lemmatizer Accuracy (md Pipeline)

Languagev3.2v3.3
Danish84.994.8
Dutch81.594.0
German73.497.7
Greek56.588.9
Finnish-86.2
Italian86.697.2
Korean-90.0
Lithuanian71.184.8
Norwegian Bokmål76.797.1
Polish87.193.7
Portuguese76.796.9
Romanian81.895.5
Swedish-95.5

New in the spaCy universe

Many cool new plugins, extensions, pipelines and tutorials have been added to the spaCy universe since v3.2:

Applied Language Technology courseNLP for newcomers using spaCy and Stanza.
AugmentyA text augmentation library.
classy-classificationA Python library for classy few-shot and zero-shot classification within spaCy.
Concise ConceptsConcise Concepts uses few-shot NER based on word embedding similarity.
Crosslingual CoreferenceCrosslingual coreference with an English coreference model plus crosslingual embeddings.
EDS-NLPspaCy components to extract information from clinical notes written in French.
eng-spacysentimentSentiment analysis for English.
HealthseaAn end-to-end spaCy pipeline for exploring health supplement effects.
HuSpaCyIndustrial-strength Hungarian natural language processing.
KlayersspaCy as an AWS Lambda Layer.
NER using spaCyNamed Entity Recognition with spaCy (video).
ScrubadubRemove personally identifiable information from text using spaCy.
spacypdfreaderEasy PDF to text to spaCy text extraction.
spacy-setfit-textcatExperiments with SetFit & Few-Shot Classification.
spacy-wrapWrap fine-tuned transformers in spaCy pipelines.
textnetsText analysis with networks.
tmtoolkitText mining and topic modeling toolkit.
View the spaCy universe

Resources