We’re pleased to present v3.3 of the spaCy Natural Language Processing library. spaCy v3.3 improves the speed of nearly all statistical pipeline components, adds a trainable lemmatizer and includes new trained pipelines for Finnish, Korean and Swedish.
Speed improvements
spaCy v3.3 includes a slew of speed improvements that increase the speed of all
core pipeline components in training and inference. For longer texts, the
trained pipeline speeds improve 15% or more in prediction. Detailed
benchmarks for en_core_web_md
show the speed improvements for spaCy v3.2 vs v3.3:
Speed Benchmarks: en_core_web_md
CPU | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
---|---|---|---|---|
Intel Xeon W-2265 | 100 | 17292 | 17441 | 0.86% |
1000 | 15408 | 16024 | 4.00% | |
10000 | 12798 | 15346 | 19.91% | |
Apple M1 | 100 | 18272 | 18408 | 0.74% |
1000 | 18794 | 19248 | 2.42% | |
10000 | 15144 | 17513 | 15.64% |
Trainable lemmatizer
The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!
displaCy for overlapping spans
displaCy now supports
overlapping span annotation from
Doc.spans
:
New trained pipelines
v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use the new trainable lemmatizer and floret vectors. Due to the use of Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |
Pipeline updates
The trained pipelines for the following languages switch from lookup or rule-based lemmatizers to the new trainable lemmatizer:
Lemmatizer Accuracy (md Pipeline)
Language | v3.2 | v3.3 |
---|---|---|
Danish | 84.9 | 94.8 |
Dutch | 81.5 | 94.0 |
German | 73.4 | 97.7 |
Greek | 56.5 | 88.9 |
Finnish | - | 86.2 |
Italian | 86.6 | 97.2 |
Korean | - | 90.0 |
Lithuanian | 71.1 | 84.8 |
Norwegian Bokmål | 76.7 | 97.1 |
Polish | 87.1 | 93.7 |
Portuguese | 76.7 | 96.9 |
Romanian | 81.8 | 95.5 |
Swedish | - | 95.5 |
New in the spaCy universe
Many cool new plugins, extensions, pipelines and tutorials have been added to the spaCy universe since v3.2:
Applied Language Technology course | NLP for newcomers using spaCy and Stanza. |
Augmenty | A text augmentation library. |
classy-classification | A Python library for classy few-shot and zero-shot classification within spaCy. |
Concise Concepts | Concise Concepts uses few-shot NER based on word embedding similarity. |
Crosslingual Coreference | Crosslingual coreference with an English coreference model plus crosslingual embeddings. |
EDS-NLP | spaCy components to extract information from clinical notes written in French. |
eng-spacysentiment | Sentiment analysis for English. |
Healthsea | An end-to-end spaCy pipeline for exploring health supplement effects. |
HuSpaCy | Industrial-strength Hungarian natural language processing. |
Klayers | spaCy as an AWS Lambda Layer. |
NER using spaCy | Named Entity Recognition with spaCy (video). |
Scrubadub | Remove personally identifiable information from text using spaCy. |
spacypdfreader | Easy PDF to text to spaCy text extraction. |
spacy-setfit-textcat | Experiments with SetFit & Few-Shot Classification. |
spacy-wrap | Wrap fine-tuned transformers in spaCy pipelines. |
textnets | Text analysis with networks. |
tmtoolkit | Text mining and topic modeling toolkit. |