spaCy v3.0 is a huge release! It features new transformer-based pipelines that get spaCy’s accuracy right up to the current state-of-the-art, and a new workflow system to help you take projects from prototype to production. It’s much easier to configure and train your pipeline, and there are lots of new and improved integrations with the rest of the NLP ecosystem.
We’ve been working on spaCy v3.0 for over a year now, and almost two years if you count all the work that’s gone into Thinc. Our main aim with the release is to make it easier to bring your own models into spaCy, especially state-of-the-art models like transformers. You can write models powering spaCy components in frameworks like PyTorch or TensorFlow, using our awesome new configuration system to describe all of your settings. And since modern NLP workflows often consist of multiple steps, there’s a new workflow system to help you keep your work organized.
For detailed installation instructions for your platform and setup, check out the installation quickstart widget.
spaCy v3.0 features all new transformer-based pipelines that bring spaCy’s
accuracy right up to the current state-of-the-art. You can use any
pretrained transformer to train your own pipelines, and even share one
transformer between multiple components with multi-task learning. spaCy’s
transformer support interoperates with PyTorch and the
giving you access to thousands of pretrained models for your pipelines. See
below for an overview of the new pipelines.
Accuracy on the OntoNotes 5.0 corpus (reported on the development set).
|Named Entity Recognition System||OntoNotes||CoNLL ‘03|
|spaCy RoBERTa (2020)||89.7||91.6|
Named entity recognition accuracy on the
OntoNotes 5.0 and
CoNLL-2003 corpora. See
more results. Project template:
1. Qi et al. (2020). 2.
Akbik et al. (2018).
spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding layer between multiple components. You can even update the shared layer, performing multi-task learning. Reusing the embedding layer between components can make your pipeline run a lot faster and result in much smaller models.
You can share a single transformer or other token-to-vector model between
multiple components by adding a
Tok2Vec component near the
start of your pipeline. Components later in the pipeline can “connect” to it by
including a listener layer within their model.
spaCy v3.0 provides retrained model families for 18 languages and 59 trained pipelines in total, including 5 new transformer-based pipelines. You can also train your own transformer-based pipelines using your own data and transformer weights of your choice.
The models are each trained with a single transformer shared across the pipeline, which requires it to be trained on a single corpus. For English and Chinese, we used the OntoNotes 5 corpus, which has annotations across several tasks. For French, Spanish and German, we didn’t have a suitable corpus that had both syntactic and entity annotations, so the transformer models for those languages do not include NER.
spaCy v3.0 introduces a comprehensive and extensible system for configuring your training runs. A single configuration file describes every detail of your training run, with no hidden defaults, making it easy to rerun your experiments and track changes.
Training config files include all settings and hyperparameters for training your pipeline. Some settings can also be registered functions that you can swap out and customize, making it easy to implement your own custom models and architectures.
Some of the main advantages and features of spaCy’s training config are:
- Structured sections. The config is grouped into sections, and nested
sections are defined using the
.notation. For example,
[components.ner]defines the settings for the pipeline’s named entity recognizer. The config can be loaded as a Python dict.
- References to registered functions. Sections can refer to registered functions like model architectures, optimizers or schedules and define arguments that are passed into them. You can also register your own functions to define custom architectures or methods, reference them in your config and tweak their parameters.
- Interpolation. If you have hyperparameters or other settings used by multiple components, define them once and reference them as variables.
- Reproducibility with no hidden defaults. The config file is the “single source of truth” and includes all settings.
- Automated checks and validation. When you load a config, spaCy checks if the settings are complete and if all values have the correct types. This lets you catch potential mistakes early. In your custom architectures, you can use Python type hints to tell the config which types of data to expect.
spaCy’s new configuration system makes
it easy to customize the neural network models used by the different pipeline
components. You can also implement your own architectures via spaCy’s machine
learning library Thinc that provides various layers and
utilities, as well as thin wrappers around frameworks like PyTorch,
TensorFlow and MXNet. Component models all follow the same unified
Model API and each
Model can also be used
as a sublayer of a larger network, allowing you to freely combine
implementations from different frameworks into a single model.
spaCy projects let you manage and share end-to-end spaCy workflows for different use cases and domains, and orchestrate training, packaging and serving your custom pipelines. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a pipeline, export it as a Python package, upload your outputs to a remote storage and share your results with your team.
spaCy projects also make it easy to integrate with other tools in the data science and machine learning ecosystem, including DVC for data version control, Prodigy for creating labelled data, Streamlit for building interactive apps, FastAPI for serving models in production, Ray for parallel training, Weights & Biases for experiment tracking, and more!
Selected example templates
To clone a template, you can run the
spacy project clone command with its
relative path, e.g.
python -m spacy project clone pipelines/ner_wikiner.
|Training a tagger and parser on a Universal Dependencies Treebank|
|Training a named entity recognition model on the WikiNER corpus|
|Text classification of emotions in Reddit posts|
|Serve trained pipelines with FastAPI|
|Visualize and explore trained pipelines with Streamlit|
Track your results with Weights & Biases
The results of each step are then logged in your project, together with the full training config. This means that every hyperparameter, registered function name and argument will be tracked and you’ll be able to see the impact it has on your results.
Ray is a fast and simple framework for building and running distributed applications. You can use Ray to train spaCy on one or more remote machines, potentially speeding up your training process.
The Ray integration is powered by a lightweight extension package,
spacy-ray, that automatically adds
ray command to your spaCy CLI if it’s
installed in the same environment. You can then run
spacy ray train for parallel training.
spaCy v3.0 includes several new trainable and rule-based components that you can add to your pipeline and customize for your use case:
|Trainable component for sentence segmentation.|
|Trainable component to predict morphological features.|
|Standalone component for rule-based and lookup lemmatization.|
|Component for setting token attributes using match patterns.|
|Component for using transformer models in your pipeline, accessing outputs and aligning tokens. Provided via |
Defining, configuring, reusing, training and analyzing
is now easier and more convenient. The
@Language.factory decorators let you
register your component and define its default configuration and meta data, like
the attribute values it assigns and requires. Any custom component can be
included during training, and sourcing components from existing trained
pipelines lets you mix and match custom pipelines. The
outputs structured information about the current pipeline and its components,
including the attributes they assign, the scores they compute during training
and whether any required attributes aren’t set.
DependencyMatcher lets you
match patterns within the dependency parse using
operators. It follows the same API as the token-based
Matcher. A pattern added to the dependency
matcher consists of a list of dictionaries, with each dictionary describing
a token to match and its relation to an existing token in the pattern.
spaCy v3.0 officially drops support for Python 2 and now requires Python
3.6+. This also means that the code base can take full advantage of
type hints. spaCy’s user-facing
API that’s implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy’s machine learning library
Thinc also features extensive
type support, including custom
types for models and arrays, and a custom
mypy plugin that can be used to
type-check model definitions.
For data validation, spaCy v3.0 adopts
pydantic. It also powers the data
validation of Thinc’s config system, which
lets you register custom functions with typed arguments, reference them in
your config and see validation errors if the argument values don’t match.