Go to top

Introducing spaCy v3.1

It’s been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new config and training system and many other features. Version 3.1 adds more on top of it, including the ability to use predicted annotations during training, a component for predicting arbitrary and overlapping spans and new trained pipelines for Catalan and Danish.

For a full overview of what’s new in spaCy v3.1 and notes on upgrading, check out the release notes and usage guide. Here are some of the most relevant additions:

Using predicted annotations during training

By default, components are updated in isolation during training, which means that they don’t see the predictions of any earlier components in the pipeline. The new [training.annotating_components] config setting lets you specify pipeline components that should set annotations on the predicted docs during training. This makes it easy to use the predictions of a previous component in the pipeline as features for a subsequent component, e.g. the dependency labels in the tagger:

config.cfg (excerpt)

[nlp]
pipeline = ["parser", "tagger"]
[components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tagger.model.tok2vec.encode.width}
attrs = ["NORM","DEP"]
rows = [5000,2500]
include_static_vectors = false
[training]
annotating_components = ["parser"]

For an end-to-end example of how to use the token.dep attribute predicted by the parser as a feature for a subsequent tagger component in the pipeline, check out this project template.

SpanCategorizer for predicting arbitrary and overlapping spans

A common task in applied NLP is extracting spans of texts from documents, including longer phrases or nested expressions. Named entity recognition isn’t the right tool for this problem, since an entity recognizer typically predicts single token-based tags that are very sensitive to boundaries. This is effective for proper nouns and self-contained expressions, but less useful for other types of phrases or overlapping spans. The new experimental SpanCategorizer component and architecture let you label arbitrary and potentially overlapping spans of texts.

The upcoming version of our annotation tool Prodigy (currently available as a pre-release for all users) will also feature a new workflow and UI for annotating overlapping and nested spans, which you can use to create training data for spaCy’s SpanCategorizer component.

Update the entity recognizer with partial incorrect annotations

The EntityRecognizer can now be updated with known incorrect annotations, which lets you take advantage of partial and sparse data. For example, you’ll be able to use the information that certain spans of text are definitely not PERSON entities, without having to provide the complete gold-standard annotations for the given example. The incorrect span annotations can be added via the Doc.spans in the training data under the key defined as incorrect_spans_key in the component config.

Annotate incorrect spans

train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
# The doc.spans key can be defined in the config
train_doc.spans["incorrect_spans"] = [
Span(doc, 0, 2, label="ORG"),
Span(doc, 5, 6, label="PRODUCT")
]

config.cfg (excerpt)

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
update_with_oracle_cut_size = 100

New pipeline packages for Catalan and Danish

spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan and a new transformer-based pipeline for Danish using the danish-bert-botxo weights. See the models directory for an overview of all available trained pipelines and the training guide for details on how to train your own.

PackageLanguageUPOSParser LAS NER F
ca_core_news_smCatalan98.287.479.8
ca_core_news_mdCatalan98.388.284.0
ca_core_news_lgCatalan98.588.484.2
ca_core_news_trfCatalan98.993.091.2
da_core_news_trfDanish98.085.082.9

Upload your pipelines to the Hugging Face Hub

The Hugging Face Hub lets you upload models and share them with others, and it now supports spaCy pipelines out-of-the-box. The extension package automatically adds command to your spacy CLI, lets you upload pipelines packaged with spacy package and takes care of auto-generating all required meta information.

Upload a trained pipeline to the hub

pip install spacy-huggingface-hub
huggingface-cli login
# Package your pipeline
python -m spacy package ./en_ner_fashion ./output --build wheel
cd ./output/en_ner_fashion-0.0.0/dist
# Upload it to the hub
python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl

After uploading, you’ll get a live URL for your model page, as well as a direct URL to the wheel file that you can install via pip install. You can also integrate the upload command into your project template to automatically upload your packaged pipelines after training.

View spaCy pipelines on the Hub

New in the spaCy universe

The spaCy universe has seen a lot of awesome additions since the last release! Here’s a selection of new plugins and extensions you can to add more power to your spaCy projects:

🐭 skweakToolkit for weak supervision applied to NLP tasks
👯 corefereeCoreference resolution for English, German and Polish
🐇 tokenwiserConnect vowpal-wabbit & scikit-learn models to spaCy
🏺 hmrbPython rule processing engine with readable syntax
🧮 numerizerConvert natural language numerics into ints and floats
🌕 spikexPipeline components for knowledge extraction
📘 trunajodText complexity library for text analysis
🧠 emfdscoreExtended Moral Foundation Dictionary Scoring
📇 denommeExtension for extracting multilingual names
💎 ruby-spacyWrapper to use spaCy in Ruby.

The following packages have been updated with support for spaCy v3:

🌳 ruTSLibrary for statistics extraction in Russian
🔍 spacy-dbpedia-spotlightUse DBpedia Spotlight to link entities
✍️ contextualSpellCheckContextual spell correction using BERT
📚 spacy-wordnetPipeline component for WordNet and WordNet Domains
⛔️ negspacyNegating concepts in text based on the NegEx algorithm
View the spaCy universe

We have a lot more planned for upcoming releases so stay tuned! Some of our current work in progress includes a native component for coreference resolution, new ecosystem integrations and end-to-end project templates for using PyTorch models to power spaCy components and training pipelines using the new span categorizer.

Resources