We’re pleased to present v3.2 of the spaCy Natural Language
Processing library. Since v3.1 we’ve added usability improvements for custom
training and scoring, improved performance on Apple M1 and Nvidia GPU hardware,
and support for space-efficient vectors using
floret
, our new hash embedding
extension to fastText.
The spaCy team has gotten a lot bigger this year, and we’ve got lots of exciting features and examples coming up, including example projects for data augmentation and model distillation, more examples of transformer-based pipelines, and new components for coreference resolution and graph-based parsing.
Improve performance for spaCy on Apple M1 with AppleOps
spaCy is now up to 8 × faster on M1 Macs by calling into Apple’s
native Accelerate library for matrix multiplication. For more details, check out
thinc-apple-ops
.
pip install spacy[apple]
Prediction speed of the
de_core_news_lg
pipeline between
the M1, Intel MacBook and AMD Ryzen 5900X with and without thinc-apple-ops
.
Results are in words per second.
CPU | BLIS | thinc-apple-ops | Package power (Watt) |
---|---|---|---|
Mac Mini (M1) | 6,492 | 27,676 | 5 |
MacBook Air Core i5 2020 | 9,790 | 10,983 | 9 |
AMD Ryzen 5900X | 22,568 | n/a | 52 |
Doc input for pipelines
nlp
and
nlp.pipe
accept
Doc
input, skipping the tokenizer if a Doc
is
provided instead of a string. This makes it easier to create a Doc
with custom
tokenization or to set custom extensions before processing:
Process a Doc object
doc = nlp.make_doc("This is text 500.")doc._.text_id = 500doc = nlp(doc)
Registered scoring functions
To customize the scoring, you can specify a scoring function for each component
in your config from the new
scorers
registry:
config.cfg (excerpt)
[components.tagger]factory = "tagger"scorer = {"@scorers":"spacy.tagger_scorer.v1"}
Support for floret vectors
We recently published floret
, an
extended version of fastText that combines fastText’s
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
subwords means that there are no OOV words and due to Bloom embeddings, the
vector table can be kept very small at <100K entries. Bloom embeddings are
already used by HashEmbed
in
tok2vec
for compact spaCy
models. For easy integration, floret
includes a
Python wrapper:
pip install floret
To get started, check out the
pipelines/floret_vectors_demo
project which trains toy English floret vectors and imports them into a spaCy
pipeline. For agglutinative languages like Finnish or Korean, there are large
improvements in performance due to the use of subwords (no OOV words!), with a
vector table containing merely 50K entries.
Finnish example project with benchmarks
To try it out, clone the
pipelines/floret_fi_core_demo
project:
python -m spacy project clone pipelines/floret_fi_core_demo
Finnish UD+NER vector and pipeline training, comparing standard fastText vs. floret vectors. For the default project settings with 1M (2.6G) tokenized training texts and 50K 300-dim vectors, ~300K keys for the standard vectors:
Vectors | TAG | POS | DEP UAS | DEP LAS | NER F |
---|---|---|---|---|---|
none | 93.5 | 92.4 | 80.1 | 73.0 | 61.6 |
standard (pruned: 50K vectors for 300K keys) | 95.9 | 95.0 | 83.1 | 77.4 | 68.1 |
standard (unpruned: 300K vectors/keys) | 96.4 | 95.0 | 82.8 | 78.4 | 70.4 |
floret (minn 4, maxn 5; 50K vectors, no OOV) | 96.9 | 95.9 | 84.5 | 79.9 | 70.1 |
Results updated on Nov. 22, 2021 for floret v0.10.1.
Korean example project with benchmarks
To try it out, clone the
pipelines/floret_ko_ud_demo
project:
python -m spacy project clone pipelines/floret_ko_ud_demo
Korean UD vector and pipeline training, comparing standard fastText vs. floret vectors. For the default project settings with 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors:
Vectors | TAG | POS | DEP UAS | DEP LAS |
---|---|---|---|---|
none | 72.5 | 85.3 | 74.0 | 65.0 |
standard (pruned: 50K vectors for 800K keys) | 77.3 | 89.1 | 78.2 | 72.2 |
standard (unpruned: 800K vectors/keys) | 79.0 | 90.3 | 79.4 | 73.9 |
floret (minn 2, maxn 3; 50K vectors, no OOV) | 82.8 | 94.1 | 83.5 | 80.5 |
Results updated on Nov. 22, 2021 for floret v0.10.1.
New transformer package for Japanese
spaCy v3.2 adds a new transformer pipeline package for Japanese
ja_core_news_trf
, which uses
the basic
pretokenizer instead of mecab
to limit the number of dependencies
required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese
community for their contributions!
New in the spaCy universe
The spaCy universe has seen some cool additions since the last release! Here’s a selection of new plugins and extensions you can install to add more power to your spaCy projects:
💬 spacy-clausie | Implementation of the ClausIE information extraction system |
🎨 ipymarkup | Collection of NLP visualizations for NER and syntax tree markup |
🌳 deplacy | Tree visualizer for Universal Dependencies and Immediate Catena Analysis |
The following packages have been updated with support for spaCy v3:
🕵️♂️ holmes | Information extraction from English and German based on predicate logic |
🌐 spaCyOpenTapioca | OpenTapioca wrapper for named entity linking on Wikidata |
🇩🇰 DaCy | State of the Art Danish NLP pipelines |