Introducing spaCy v3.0

spaCy v3.0 is a huge release! It features new transformer-based pipelines that get spaCy’s accuracy right up to the current state-of-the-art, and a new workflow system to help you take projects from prototype to production. It’s much easier to configure and train your pipeline, and there are lots of new and improved integrations with the rest of the NLP ecosystem.

We’ve been working on spaCy v3.0 for over a year now, and almost two years if you count all the work that’s gone into Thinc. Our main aim with the release is to make it easier to bring your own models into spaCy, especially state-of-the-art models like transformers. You can write models powering spaCy components in frameworks like PyTorch or TensorFlow, using our awesome new configuration system to describe all of your settings. And since modern NLP workflows often consist of multiple steps, there’s a new workflow system to help you keep your work organized.

For detailed installation instructions for your platform and setup, check out the installation quickstart widget.

pip install -U spacy

Transformer-based pipelines

spaCy v3.0 features all new transformer-based pipelines that bring spaCy’s accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. spaCy’s transformer support interoperates with PyTorch and the HuggingFace transformers library, giving you access to thousands of pretrained models for your pipelines. See below for an overview of the new pipelines.

Pipeline	Parser	Tagger	NER
`en_core_web_trf` (spaCy v3)	95.5	98.3	89.4
`en_core_web_lg` (spaCy v3)	92.2	97.4	85.4
`en_core_web_lg` (spaCy v2)	91.9	97.2	85.5

Accuracy on the OntoNotes 5.0 corpus (reported on the development set).

Named Entity Recognition System	OntoNotes	CoNLL ‘03
spaCy RoBERTa (2020)	89.7	91.6
Stanza (StanfordNLP)¹	88.8	92.1
Flair²	89.7	93.1

Named entity recognition accuracy on the OntoNotes 5.0 and CoNLL-2003 corpora. See NLP-progress for more results. Project template: benchmarks/ner_conll03. 1. Qi et al. (2020). 2. Akbik et al. (2018).

spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding layer between multiple components. You can even update the shared layer, performing multi-task learning. Reusing the embedding layer between components can make your pipeline run a lot faster and result in much smaller models.

You can share a single transformer or other token-to-vector model between multiple components by adding a Transformer or Tok2Vec component near the start of your pipeline. Components later in the pipeline can “connect” to it by including a listener layer within their model.

New trained pipelines

spaCy v3.0 provides retrained model families for 18 languages and 59 trained pipelines in total, including 5 new transformer-based pipelines. You can also train your own transformer-based pipelines using your own data and transformer weights of your choice.

Package	Language	Transformer	Tagger	Parser	NER
`en_core_web_trf`	English	`roberta-base`	97.8	95.2	89.9
`de_dep_news_trf`	German	`bert-base-german-cased`	99.0	95.8	-
`es_dep_news_trf`	Spanish	`bert-base-spanish-wwm-cased`	98.2	94.6	-
`fr_dep_news_trf`	French	`camembert-base`	95.7	94.4	-
`zh_core_web_trf`	Chinese	`bert-base-chinese`	92.5	76.6	75.4

The models are each trained with a single transformer shared across the pipeline, which requires it to be trained on a single corpus. For English and Chinese, we used the OntoNotes 5 corpus, which has annotations across several tasks. For French, Spanish and German, we didn’t have a suitable corpus that had both syntactic and entity annotations, so the transformer models for those languages do not include NER.

Download pipelines

New training workflow and config system

spaCy v3.0 introduces a comprehensive and extensible system for configuring your training runs. A single configuration file describes every detail of your training run, with no hidden defaults, making it easy to rerun your experiments and track changes.

You can use the quickstart widget or the init config command to get started. Instead of providing lots of arguments on the command line, you only need to pass your config.cfg file to spacy train.

Training config files include all settings and hyperparameters for training your pipeline. Some settings can also be registered functions that you can swap out and customize, making it easy to implement your own custom models and architectures.

config.cfg
[training]
accumulate_gradient = 3

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.01

Some of the main advantages and features of spaCy’s training config are:

Structured sections. The config is grouped into sections, and nested sections are defined using the . notation. For example, [components.ner] defines the settings for the pipeline’s named entity recognizer. The config can be loaded as a Python dict.
References to registered functions. Sections can refer to registered functions like model architectures, optimizers or schedules and define arguments that are passed into them. You can also register your own functions to define custom architectures or methods, reference them in your config and tweak their parameters.
Interpolation. If you have hyperparameters or other settings used by multiple components, define them once and reference them as variables.
Reproducibility with no hidden defaults. The config file is the “single source of truth” and includes all settings.
Automated checks and validation. When you load a config, spaCy checks if the settings are complete and if all values have the correct types. This lets you catch potential mistakes early. In your custom architectures, you can use Python type hints to tell the config which types of data to expect.

Custom models using any framework

spaCy’s new configuration system makes it easy to customize the neural network models used by the different pipeline components. You can also implement your own architectures via spaCy’s machine learning library Thinc that provides various layers and utilities, as well as thin wrappers around frameworks like PyTorch, TensorFlow and MXNet. Component models all follow the same unified Model API and each Model can also be used as a sublayer of a larger network, allowing you to freely combine implementations from different frameworks into a single model.

Wrapping a PyTorch model
from torch import nn
from thinc.api import PyTorchWrapper

torch_model = nn.Sequential(
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Softmax(dim=1)
)
model = PyTorchWrapper(torch_model)

Manage end-to-end workflows with projects

spaCy projects let you manage and share end-to-end spaCy workflows for different use cases and domains, and orchestrate training, packaging and serving your custom pipelines. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a pipeline, export it as a Python package, upload your outputs to a remote storage and share your results with your team.

spaCy projects also make it easy to integrate with other tools in the data science and machine learning ecosystem, including DVC for data version control, Prodigy for creating labelled data, Streamlit for building interactive apps, FastAPI for serving models in production, Ray for parallel training, Weights & Biases for experiment tracking, and more!

Using spaCy projects
# Clone a project template
python -m spacy project clone pipelines/tagger_parser_ud
cd tagger_parser_ud
# Download data assets
python -m spacy project assets
# Run a workflow
python -m spacy project run all

Selected example templates

To clone a template, you can run the spacy project clone command with its relative path, e.g. python -m spacy project clone pipelines/ner_wikiner.


`pipelines/tagger_parser_ud`	Training a tagger and parser on a Universal Dependencies Treebank
`pipelines/ner_wikiner`	Training a named entity recognition model on the WikiNER corpus
`tutorials/textcat_goemotions`	Text classification of emotions in Reddit posts
`integrations/fastapi`	Serve trained pipelines with FastAPI
`integrations/streamlit`	Visualize and explore trained pipelines with Streamlit

Track your results with Weights & Biases

Weights & Biases is a popular platform for experiment tracking. spaCy integrates with it out-of-the-box via the WandbLogger, which you can add as the [training.logger] block of your training config.

The results of each step are then logged in your project, together with the full training config. This means that every hyperparameter, registered function name and argument will be tracked and you’ll be able to see the impact it has on your results.

config.cfg
[training.logger]
@loggers = "spacy.WandbLogger.v1"
project_name = "monitor_spacy_training"
remove_config_values = ["paths.train", "paths.dev", "training.dev_corpus.path", "training.train_corpus.path"]

Parallel and distributed training with Ray

Ray is a fast and simple framework for building and running distributed applications. You can use Ray to train spaCy on one or more remote machines, potentially speeding up your training process.

The Ray integration is powered by a lightweight extension package, spacy-ray, that automatically adds the ray command to your spaCy CLI if it’s installed in the same environment. You can then run spacy ray train for parallel training.

Parallel training with Ray
pip install spacy-ray --pre
# Check that the CLI is registered
python -m spacy ray --help
# Train a pipeline
python -m spacy ray train config.cfg --n-workers 2

New built-in pipeline components

spaCy v3.0 includes several new trainable and rule-based components that you can add to your pipeline and customize for your use case:


`SentenceRecognizer`	Trainable component for sentence segmentation.
`Morphologizer`	Trainable component to predict morphological features.
`Lemmatizer`	Standalone component for rule-based and lookup lemmatization.
`AttributeRuler`	Component for setting token attributes using match patterns.
`Transformer`	Component for using transformer models in your pipeline, accessing outputs and aligning tokens. Provided via `spacy-transformers`.

New and improved pipeline component APIs

Defining, configuring, reusing, training and analyzing pipeline components is now easier and more convenient. The @Language.component and @Language.factory decorators let you register your component and define its default configuration and meta data, like the attribute values it assigns and requires. Any custom component can be included during training, and sourcing components from existing trained pipelines lets you mix and match custom pipelines. The nlp.analyze_pipes method outputs structured information about the current pipeline and its components, including the attributes they assign, the scores they compute during training and whether any required attributes aren’t set.

import spacy
from spacy.language import Language

@Language.component("my_component")
def my_component(doc):
    return doc

nlp = spacy.blank("en")
# Add components using their string names
nlp.add_pipe("my_component")
# Source components from other pipelines
other_nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("ner", source=other_nlp)
# Analyze components and their attributes
nlp.analyze_pipes(pretty=True)

Dependency matching

The new DependencyMatcher lets you match patterns within the dependency parse using Semgrex operators. It follows the same API as the token-based Matcher. A pattern added to the dependency matcher consists of a list of dictionaries, with each dictionary describing a token to match and its relation to an existing token in the pattern.

Illustration showing part of the match pattern

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")
matcher = DependencyMatcher(nlp.vocab)
pattern = [
    {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
    {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}},
    {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "founded_object", "RIGHT_ATTRS": {"DEP": "dobj"}},
    {"LEFT_ID": "founded_object", "REL_OP": ">", "RIGHT_ID": "founded_object_modifier", "RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}}}
]
matcher.add("FOUNDED", [pattern])
doc = nlp("Lee, an experienced CEO, has founded two AI startups.")
matches = matcher(doc)

Type hints and type-based data validation

spaCy v3.0 officially drops support for Python 2 and now requires Python 3.6+. This also means that the code base can take full advantage of type hints. spaCy’s user-facing API that’s implemented in pure Python (as opposed to Cython) now comes with type hints. The new version of spaCy’s machine learning library Thinc also features extensive type support, including custom types for models and arrays, and a custom mypy plugin that can be used to type-check model definitions.

For data validation, spaCy v3.0 adopts pydantic. It also powers the data validation of Thinc’s config system, which lets you register custom functions with typed arguments, reference them in your config and see validation errors if the argument values don’t match.

Argument validation with type hints
from spacy.language import Language
from pydantic import StrictBool

@Language.factory("my_component")
def create_component(nlp: Language, name: str, custom: StrictBool):
   ...