Introducing spaCy v3.1

· by Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd & Paul O’Leary McCann· ~8 min. read

It’s been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new config and training system and many other features. Version 3.1 adds more on top of it, including the ability to use predicted annotations during training, a component for predicting arbitrary and overlapping spans and new trained pipelines for Catalan and Danish.

For a full overview of what’s new in spaCy v3.1 and notes on upgrading, check out the release notes and usage guide. Here are some of the most relevant additions:

Using predicted annotations during training

By default, components are updated in isolation during training, which means that they don’t see the predictions of any earlier components in the pipeline. The new [training.annotating_components] config setting lets you specify pipeline components that should set annotations on the predicted docs during training. This makes it easy to use the predictions of a previous component in the pipeline as features for a subsequent component, e.g. the dependency labels in the tagger:

config.cfg (excerpt)[nlp]
pipeline = ["parser", "tagger"]

[components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tagger.model.tok2vec.encode.width}
attrs = ["NORM","DEP"]
rows = [5000,2500]
include_static_vectors = false

[training]
annotating_components = ["parser"]

For an end-to-end example of how to use the token.dep attribute predicted by the parser as a feature for a subsequent tagger component in the pipeline, check out this project template.

SpanCategorizer for predicting arbitrary and overlapping spans

A common task in applied NLP is extracting spans of texts from documents, including longer phrases or nested expressions. Named entity recognition isn’t the right tool for this problem, since an entity recognizer typically predicts single token-based tags that are very sensitive to boundaries. This is effective for proper nouns and self-contained expressions, but less useful for other types of phrases or overlapping spans. The new experimental SpanCategorizer component and architecture let you label arbitrary and potentially overlapping spans of texts.

The upcoming version of our annotation tool Prodigy (currently available as a pre-release for all users) will also feature a new workflow and UI for annotating overlapping and nested spans, which you can use to create training data for spaCy’s SpanCategorizer component.

Update the entity recognizer with partial incorrect annotations

The EntityRecognizer can now be updated with known incorrect annotations, which lets you take advantage of partial and sparse data. For example, you’ll be able to use the information that certain spans of text are definitely not PERSON entities, without having to provide the complete gold-standard annotations for the given example. The incorrect span annotations can be added via the Doc.spans in the training data under the key defined as incorrect_spans_key in the component config.

Annotate incorrect spanstrain_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
# The doc.spans key can be defined in the config
train_doc.spans["incorrect_spans"] = [
  Span(doc, 0, 2, label="ORG"),
  Span(doc, 5, 6, label="PRODUCT")
]
config.cfg (excerpt)[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
update_with_oracle_cut_size = 100

New pipeline packages for Catalan and Danish

spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan and a new transformer-based pipeline for Danish using the danish-bert-botxo weights. See the models directory for an overview of all available trained pipelines and the training guide for details on how to train your own.

PackageLanguageUPOSParser LAS NER F
ca_core_news_smCatalan98.287.479.8
ca_core_news_mdCatalan98.388.284.0
ca_core_news_lgCatalan98.588.484.2
ca_core_news_trfCatalan98.993.091.2
da_core_news_trfDanish98.085.082.9

Thanks to Carlos Rodríguez Penagos and the Barcelona Supercomputing Center for their contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional Danish pipelines, check out DaCy.

Upload your pipelines to the Hugging Face Hub

The Hugging Face Hub lets you upload models and share them with others, and it now supports spaCy pipelines out-of-the-box. The extension package automatically adds command to your spacy CLI, lets you upload pipelines packaged with spacy package and takes care of auto-generating all required meta information.

Upload a trained pipeline to the hubpip install spacy-huggingface-hub
huggingface-cli login
# Package your pipeline
python -m spacy package ./en_ner_fashion ./output --build wheel
cd ./output/en_ner_fashion-0.0.0/dist
# Upload it to the hub
python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl

After uploading, you’ll get a live URL for your model page, as well as a direct URL to the wheel file that you can install via pip install. You can also integrate the upload command into your project template to automatically upload your packaged pipelines after training.

View spaCy pipelines on the Hub

New in the spaCy universe

The spaCy universe has seen a lot of awesome additions since the last release! Here’s a selection of new plugins and extensions you can to add more power to your spaCy projects:

🐭 skweakToolkit for weak supervision applied to NLP tasks
👯 corefereeCoreference resolution for English, German and Polish
🐇 tokenwiserConnect vowpal-wabbit & scikit-learn models to spaCy
🏺 hmrbPython rule processing engine with readable syntax
🧮 numerizerConvert natural language numerics into ints and floats
🌕 spikexPipeline components for knowledge extraction
📘 trunajodText complexity library for text analysis
🧠 emfdscoreExtended Moral Foundation Dictionary Scoring
📇 denommeExtension for extracting multilingual names
💎 ruby-spacyWrapper to use spaCy in Ruby.

The following packages have been updated with support for spaCy v3:

🌳 ruTSLibrary for statistics extraction in Russian
🔍 spacy-dbpedia-spotlightUse DBpedia Spotlight to link entities
✍️ contextualSpellCheckContextual spell correction using BERT
📚 spacy-wordnetPipeline component for WordNet and WordNet Domains
⛔️ negspacyNegating concepts in text based on the NegEx algorithm

View the spaCy universe

We have a lot more planned for upcoming releases so stay tuned! Some of our current work in progress includes a native component for coreference resolution, new ecosystem integrations and end-to-end project templates for using PyTorch models to power spaCy components and training pipelines using the new span categorizer.

Resources

  • About the author

    Matthew Honnibal

    Matthew is a leading expert in AI technology. He completed his PhD in 2009, and spent a further 5 years publishing research on state-of-the-art NLP systems. He left academia in 2014 to write spaCy and found Explosion.

  • About the author

    Ines Montani

    Ines is a co-founder of Explosion and a core developer of the spaCy NLP library and the Prodigy annotation tool. She has helped set a new standard for user experience in developer tools for AI engineers and researchers.

  • About the author

    Sofie Van Landeghem

    Sofie is a machine learning and NLP engineer with over 13 years of experience. Her doctoral research focused on text mining for life-sciences, followed by further work in the pharmaceuticals and food industries after her PhD.

  • About the author

    Adriane Boyd

    Adriane is a computational linguist who has been engaged in research since 2005, completing her PhD in 2012. She has extensive experience in quality control for linguistic annotation, parsing, and NLP for non-standard language.

  • About the author

    Paul O’Leary McCann

    Paul has built a broad knowledge of data engineering, systems programming, linguistics and NLP since completing his master's in 2011. He has a particularly deep knowledge of Japanese NLP, and has led spaCy’s Japanese support since its release.