Introducing spaCy v3.1

It’s been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new config and training system and many other features. Version 3.1 adds more on top of it, including the ability to use predicted annotations during training, a component for predicting arbitrary and overlapping spans and new trained pipelines for Catalan and Danish.

For a full overview of what’s new in spaCy v3.1 and notes on upgrading, check out the release notes and usage guide. Here are some of the most relevant additions:

Using predicted annotations during training

By default, components are updated in isolation during training, which means that they don’t see the predictions of any earlier components in the pipeline. The new [training.annotating_components] config setting lets you specify pipeline components that should set annotations on the predicted docs during training. This makes it easy to use the predictions of a previous component in the pipeline as features for a subsequent component, e.g. the dependency labels in the tagger:

config.cfg (excerpt)
[nlp]
pipeline = ["parser", "tagger"]

[components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tagger.model.tok2vec.encode.width}
attrs = ["NORM","DEP"]
rows = [5000,2500]
include_static_vectors = false

[training]
annotating_components = ["parser"]

For an end-to-end example of how to use the token.dep attribute predicted by the parser as a feature for a subsequent tagger component in the pipeline, check out this project template.

SpanCategorizer for predicting arbitrary and overlapping spans

A common task in applied NLP is extracting spans of texts from documents, including longer phrases or nested expressions. Named entity recognition isn’t the right tool for this problem, since an entity recognizer typically predicts single token-based tags that are very sensitive to boundaries. This is effective for proper nouns and self-contained expressions, but less useful for other types of phrases or overlapping spans. The new experimental SpanCategorizer component and architecture let you label arbitrary and potentially overlapping spans of texts.

The upcoming version of our annotation tool Prodigy (currently available as a pre-release for all users) will also feature a new workflow and UI for annotating overlapping and nested spans, which you can use to create training data for spaCy’s SpanCategorizer component.

Update the entity recognizer with partial incorrect annotations

The EntityRecognizer can now be updated with known incorrect annotations, which lets you take advantage of partial and sparse data. For example, you’ll be able to use the information that certain spans of text are definitely not PERSON entities, without having to provide the complete gold-standard annotations for the given example. The incorrect span annotations can be added via the Doc.spans in the training data under the key defined as incorrect_spans_key in the component config.

Annotate incorrect spans
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
# The doc.spans key can be defined in the config
train_doc.spans["incorrect_spans"] = [
  Span(doc, 0, 2, label="ORG"),
  Span(doc, 5, 6, label="PRODUCT")
]

config.cfg (excerpt)
[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
update_with_oracle_cut_size = 100

New pipeline packages for Catalan and Danish

spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan and a new transformer-based pipeline for Danish using the danish-bert-botxo weights. See the models directory for an overview of all available trained pipelines and the training guide for details on how to train your own.

Package	Language	UPOS	Parser LAS	NER F
`ca_core_news_sm`	Catalan	98.2	87.4	79.8
`ca_core_news_md`	Catalan	98.3	88.2	84.0
`ca_core_news_lg`	Catalan	98.5	88.4	84.2
`ca_core_news_trf`	Catalan	98.9	93.0	91.2
`da_core_news_trf`	Danish	98.0	85.0	82.9

Upload your pipelines to the Hugging Face Hub

The Hugging Face Hub lets you upload models and share them with others, and it now supports spaCy pipelines out-of-the-box. The extension package automatically adds command to your spacy CLI, lets you upload pipelines packaged with spacy package and takes care of auto-generating all required meta information.

Upload a trained pipeline to the hub
pip install spacy-huggingface-hub
huggingface-cli login
# Package your pipeline
python -m spacy package ./en_ner_fashion ./output --build wheel
cd ./output/en_ner_fashion-0.0.0/dist
# Upload it to the hub
python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl

After uploading, you’ll get a live URL for your model page, as well as a direct URL to the wheel file that you can install via pip install. You can also integrate the upload command into your project template to automatically upload your packaged pipelines after training.

View spaCy pipelines on the Hub

New in the spaCy universe

The spaCy universe has seen a lot of awesome additions since the last release! Here’s a selection of new plugins and extensions you can to add more power to your spaCy projects:


🐭 `skweak`	Toolkit for weak supervision applied to NLP tasks
👯 `coreferee`	Coreference resolution for English, German and Polish
🐇 `tokenwiser`	Connect vowpal-wabbit & scikit-learn models to spaCy
🏺 `hmrb`	Python rule processing engine with readable syntax
🧮 `numerizer`	Convert natural language numerics into ints and floats
🌕 `spikex`	Pipeline components for knowledge extraction
📘 `trunajod`	Text complexity library for text analysis
🧠 `emfdscore`	Extended Moral Foundation Dictionary Scoring
📇 `denomme`	Extension for extracting multilingual names
💎 `ruby-spacy`	Wrapper to use spaCy in Ruby.

The following packages have been updated with support for spaCy v3:


🌳 `ruTS`	Library for statistics extraction in Russian
🔍 `spacy-dbpedia-spotlight`	Use DBpedia Spotlight to link entities
✍️ `contextualSpellCheck`	Contextual spell correction using BERT
📚 `spacy-wordnet`	Pipeline component for WordNet and WordNet Domains
⛔️ `negspacy`	Negating concepts in text based on the NegEx algorithm

View the spaCy universe

We have a lot more planned for upcoming releases so stay tuned! Some of our current work in progress includes a native component for coreference resolution, new ecosystem integrations and end-to-end project templates for using PyTorch models to power spaCy components and training pipelines using the new span categorizer.

Resources

spaCy v3.1: What’s new in v3.1
Release notes: Detailed overview
spaCy models directory: Download trained pipelines
spaCy universe: Projects, plugins and extensions
spaCy project templates: End-to-end NLP workflows
Video tutorials: More in-depth spaCy content on YouTube

How to advocate for modular NLP in the age of Generative AI

Introducing spaCy v3.1

Using predicted annotations during training

config.cfg (excerpt)

SpanCategorizer for predicting arbitrary and overlapping spans

Update the entity recognizer with partial incorrect annotations

Annotate incorrect spans

config.cfg (excerpt)

New pipeline packages for Catalan and Danish

Upload your pipelines to the Hugging Face Hub

Upload a trained pipeline to the hub

New in the spaCy universe

Resources

How to advocate for modular NLP in the age of Generative AI

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

What the history of the web can teach us about the future of AI

From PDFs to AI-ready structured data: a deep dive