It’s been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new config and training system and many other features. Version 3.1 adds more on top of it, including the ability to use predicted annotations during training, a component for predicting arbitrary and overlapping spans and new trained pipelines for Catalan and Danish.
For a full overview of what’s new in spaCy v3.1 and notes on upgrading, check out the release notes and usage guide. Here are some of the most relevant additions:
Using predicted annotations during training
By default, components are updated in isolation during training, which means
that they don’t see the predictions of any earlier components in the pipeline.
The new
[training.annotating_components]
config setting lets you specify pipeline components that should set annotations
on the predicted docs during training. This makes it easy to use the predictions
of a previous component in the pipeline as features for a subsequent component,
e.g. the dependency labels in the tagger:
config.cfg (excerpt)
[nlp]pipeline = ["parser", "tagger"][components.tagger.model.tok2vec.embed]@architectures = "spacy.MultiHashEmbed.v1"width = ${components.tagger.model.tok2vec.encode.width}attrs = ["NORM","DEP"]rows = [5000,2500]include_static_vectors = false[training]annotating_components = ["parser"]
For an end-to-end example of how to use the token.dep
attribute predicted by
the parser as a feature for a subsequent tagger component in the pipeline, check
out
this project template.
SpanCategorizer for predicting arbitrary and overlapping spans
A common task in applied NLP is extracting spans
of texts from documents, including longer phrases or nested expressions. Named
entity recognition isn’t the right tool for this problem, since an entity
recognizer typically predicts single token-based tags that are very sensitive to
boundaries. This is effective for proper nouns and self-contained expressions,
but less useful for other types of phrases or overlapping spans. The new
experimental SpanCategorizer
component
and architecture let you
label arbitrary and potentially overlapping spans of texts.
The upcoming version of our annotation tool Prodigy
(currently available as a pre-release for all
users) will also feature a
new workflow and UI for annotating
overlapping and nested spans, which you can use to create training data for
spaCy’s SpanCategorizer
component.
Update the entity recognizer with partial incorrect annotations
The EntityRecognizer
can now be
updated with known incorrect annotations, which lets you take advantage of
partial and sparse data. For example, you’ll be able to use the information that
certain spans of text are definitely not PERSON
entities, without having
to provide the complete gold-standard annotations for the given example. The
incorrect span annotations can be added via the
Doc.spans
in the training data under the key
defined as incorrect_spans_key
in the component config.
Annotate incorrect spans
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")# The doc.spans key can be defined in the configtrain_doc.spans["incorrect_spans"] = [Span(doc, 0, 2, label="ORG"),Span(doc, 5, 6, label="PRODUCT")]
config.cfg (excerpt)
[components.ner]factory = "ner"incorrect_spans_key = "incorrect_spans"moves = nullupdate_with_oracle_cut_size = 100
New pipeline packages for Catalan and Danish
spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan
and a new transformer-based pipeline for Danish using the
danish-bert-botxo
weights.
See the models directory for an overview of all
available trained pipelines and the
training guide for details on how to train
your own.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
ca_core_news_sm | Catalan | 98.2 | 87.4 | 79.8 |
ca_core_news_md | Catalan | 98.3 | 88.2 | 84.0 |
ca_core_news_lg | Catalan | 98.5 | 88.4 | 84.2 |
ca_core_news_trf | Catalan | 98.9 | 93.0 | 91.2 |
da_core_news_trf | Danish | 98.0 | 85.0 | 82.9 |
Upload your pipelines to the Hugging Face Hub
The Hugging Face Hub lets you upload models and share
them with others, and it now supports spaCy pipelines out-of-the-box. The
extension package
automatically adds command to your spacy
CLI, lets you upload pipelines
packaged with spacy package
and takes care of
auto-generating all required meta information.
Upload a trained pipeline to the hub
pip install spacy-huggingface-hubhuggingface-cli login# Package your pipelinepython -m spacy package ./en_ner_fashion ./output --build wheelcd ./output/en_ner_fashion-0.0.0/dist# Upload it to the hubpython -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl
After uploading, you’ll get a live URL for your model page, as well as a direct
URL to the wheel file that you can install via pip install
. You can also
integrate the upload command into your
project template to
automatically upload your packaged pipelines after training.
New in the spaCy universe
The spaCy universe has seen a lot of awesome additions since the last release! Here’s a selection of new plugins and extensions you can to add more power to your spaCy projects:
🐭 skweak | Toolkit for weak supervision applied to NLP tasks |
👯 coreferee | Coreference resolution for English, German and Polish |
🐇 tokenwiser | Connect vowpal-wabbit & scikit-learn models to spaCy |
🏺 hmrb | Python rule processing engine with readable syntax |
🧮 numerizer | Convert natural language numerics into ints and floats |
🌕 spikex | Pipeline components for knowledge extraction |
📘 trunajod | Text complexity library for text analysis |
🧠 emfdscore | Extended Moral Foundation Dictionary Scoring |
📇 denomme | Extension for extracting multilingual names |
💎 ruby-spacy | Wrapper to use spaCy in Ruby. |
The following packages have been updated with support for spaCy v3:
🌳 ruTS | Library for statistics extraction in Russian |
🔍 spacy-dbpedia-spotlight | Use DBpedia Spotlight to link entities |
✍️ contextualSpellCheck | Contextual spell correction using BERT |
📚 spacy-wordnet | Pipeline component for WordNet and WordNet Domains |
⛔️ negspacy | Negating concepts in text based on the NegEx algorithm |
We have a lot more planned for upcoming releases so stay tuned! Some of our current work in progress includes a native component for coreference resolution, new ecosystem integrations and end-to-end project templates for using PyTorch models to power spaCy components and training pipelines using the new span categorizer.