Introducing spaCy v3.2

· by the spaCy team · ~6 min. read

We’re pleased to present v3.2 of the spaCy Natural Language Processing library. Since v3.1 we’ve added usability improvements for custom training and scoring, improved performance on Apple M1 and Nvidia GPU hardware, and support for space-efficient vectors using floret, our new hash embedding extension to fastText.

The spaCy team has gotten a lot bigger this year, and we’ve got lots of exciting features and examples coming up, including example projects for data augmentation and model distillation, more examples of transformer-based pipelines, and new components for coreference resolution and graph-based parsing.

Improve performance for spaCy on Apple M1 with AppleOps

spaCy is now up to 8 × faster on M1 Macs by calling into Apple’s native Accelerate library for matrix multiplication. For more details, check out thinc-apple-ops.

pip install spacy[apple]

Benchmarks

Prediction speed of the de_core_news_lg pipeline between the M1, Intel MacBook and AMD Ryzen 5900X with and without thinc-apple-ops. Results are in words per second.

CPUBLISthinc-apple-opsPackage power (Watt)
Mac Mini (M1)6,49227,6765
MacBook Air Core i5 20209,79010,9839
AMD Ryzen 5900X22,568n/a52

Doc input for pipelines

nlp and nlp.pipe accept Doc input, skipping the tokenizer if a Doc is provided instead of a string. This makes it easier to create a Doc with custom tokenization or to set custom extensions before processing:

Process a Doc objectdoc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)

Registered scoring functions

To customize the scoring, you can specify a scoring function for each component in your config from the new scorers registry:

config.cfg (excerpt)[components.tagger]
factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

Support for floret vectors

We recently published floret, an extended version of fastText that combines fastText’s subwords with Bloom embeddings for compact, full-coverage vectors. The use of subwords means that there are no OOV words and due to Bloom embeddings, the vector table can be kept very small at <100K entries. Bloom embeddings are already used by HashEmbed in tok2vec for compact spaCy models. For easy integration, floret includes a Python wrapper:

pip install floret

To get started, check out the pipelines/floret_vectors_demo project which trains toy English floret vectors and imports them into a spaCy pipeline. For agglutinative languages like Finnish or Korean, there are large improvements in performance due to the use of subwords (no OOV words!), with a vector table containing merely 50K entries.

Finnish example project with benchmarks

To try it out, clone the pipelines/floret_fi_core_demo project:

python -m spacy project clone pipelines/floret_fi_core_demo

Finnish UD+NER vector and pipeline training, comparing standard fastText vs. floret vectors. For the default project settings with 1M (2.6G) tokenized training texts and 50K 300-dim vectors, ~300K keys for the standard vectors:

VectorsTAGPOSDEP UASDEP LASNER F
none93.592.480.173.061.6
standard (pruned: 50K vectors for 300K keys)95.995.083.177.468.1
standard (unpruned: 300K vectors/keys)96.495.082.878.470.4
floret (minn 4, maxn 5; 50K vectors, no OOV)96.995.984.579.970.1
Results updated on Nov. 22, 2021 for floret v0.10.1.

Korean example project with benchmarks

To try it out, clone the pipelines/floret_ko_ud_demo project:

python -m spacy project clone pipelines/floret_ko_ud_demo

Korean UD vector and pipeline training, comparing standard fastText vs. floret vectors. For the default project settings with 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

VectorsTAGPOSDEP UASDEP LAS
none72.585.374.065.0
standard (pruned: 50K vectors for 800K keys)77.389.178.272.2
standard (unpruned: 800K vectors/keys)79.090.379.473.9
floret (minn 2, maxn 3; 50K vectors, no OOV)82.894.183.580.5
Results updated on Nov. 22, 2021 for floret v0.10.1.

New transformer package for Japanese

spaCy v3.2 adds a new transformer pipeline package for Japanese ja_core_news_trf, which uses the basic pretokenizer instead of mecab to limit the number of dependencies required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for their contributions!

New in the spaCy universe

The spaCy universe has seen some cool additions since the last release! Here’s a selection of new plugins and extensions you can install to add more power to your spaCy projects:

💬 spacy-clausieImplementation of the ClausIE information extraction system
🎨 ipymarkupCollection of NLP visualizations for NER and syntax tree markup
🌳 deplacyTree visualizer for Universal Dependencies and Immediate Catena Analysis

The following packages have been updated with support for spaCy v3:

🕵️‍♂️ holmesInformation extraction from English and German based on predicate logic
🌐 spaCyOpenTapiocaOpenTapioca wrapper for named entity linking on Wikidata
🇩🇰 DaCyState of the Art Danish NLP pipelines

View the spaCy universe

Resources

About the authors

  • Matthew Honnibal CTO, Founder

    • Berlin, Germany
  • Ines Montani CEO, Founder

    • Berlin, Germany
  • Sofie Van Landeghem Lead Machine Learning Engineer, spaCy Core Developer

    • Ghent, Belgium
  • Adriane Boyd Machine Learning Engineer, spaCy Core Developer

    • Tübingen, Germany
  • Paul O’Leary McCann Machine Learning Engineer, spaCy Core Developer

    • Tokyo, Japan
  • Daniël de Kok Machine Learning Engineer, spaCy Core Developer

    • Groningen, Netherlands
  • Duygu Altinok Machine Learning Engineer, spaCy Core Developer

    • Berlin, Germany
  • Edward Schmuhl Machine Learning Engineer

    • Berlin, Germany
  • Lj Miranda Machine Learning Engineer

    • Manila, Philippines
  • Philip Vollet Community Success

    • Berlin, Germany