To demonstrate the performance of spaCy v3.2, we present a series of UD benchmarks comparable to the Stanza and Trankit evaluations on Universal Dependencies v2.5, using the evaluation from the CoNLL 2018 Shared Task.
The benchmarks show the competitive performance of spaCy’s core components for tagging, parsing and sentence segmentation and also let us highlight and evaluate the new edit tree lemmatizer. The trained pipelines in the benchmarks are made available for download on Explosion’s Hugging Face Hub repo and a UD benchmark project lets you run the full training and evaluation for any Universal Dependencies corpus.
The core syntactic annotation is performed by built-in spaCy components:
- XPOS: Tagger
- UPOS+UFeats: Morphologizer
- Head+Deprel+Sentences: Parser
- Sentences (optional, alternative to parser): Senter
Experimental components are used for tokenization and lemmatization:
Aside from the tokenizer, the pipeline components are trained with a single
transformer component using
xlm-roberta-base, similar to Trankit Base.
While many spaCy pipelines are trained on Universal Dependencies corpora, we haven’t published full Universal Dependencies benchmarks in the past because spaCy v2 and v3 pipelines have primarily relied on rule-based components for tokenization and lemmatization, which are good for speed in production, but not for training from scratch for a language that only has partial support in spaCy or where spaCy’s defaults don’t align well with the corpus annotation scheme.
Tokenization presents a particular problem, since every single error lowers the ceiling for the performance of all the following components. In order to give spaCy’s core components a fair shake in comparison with other libraries, we switch from a fast rule-based tokenizer to a slower trainable tokenizer that doesn’t require any manual customization. This new experimental tokenizer uses spaCy’s built-in NER component under the hood to segment on the character level, following the idea behind the Elephant tokenizer (Evang et al. 2013).
For lemmatization, we use the new experimental
edit tree lemmatizer, which we recently added
along with the experimental tokenizer to our new
where we plan to provide in-progress features and components while we refine and
evaluate them for inclusion in the core spaCy library.
The other remaining issue for spaCy and Universal Dependencies is multi-word
tokens (MWTs), which don’t fit well into spaCy’s
Doc objects. A spaCy
Doc aligns each token directly with a series of
characters in the input text and it doesn’t support multiple token texts or
multiple tokenizations within the same document. As a result, especially for
cases where the UD word forms don’t correspond to the token form in the text,
it’s difficult to implement an MWT expander for spaCy because the annotation
couldn’t be stored easily on the text-based tokens in the
For now, we side-step this mismatch and focus on UD corpora with no or few MWTs, since this gives a more accurate impression of the performance of spaCy’s pipeline components. For the corpora with a small number of MWTs, we use spaCy’s CoNLL-U converter to merge MWTs into single tokens that have the text of the original token with linguistic features merged from the word annotations. The lower “Words” scores do cascade into the remaining evaluation metrics, but you can get a better impression of the performance of spaCy itself from the aligned accuracy scores.
The pipeline has the following configuration, with the relevant UD evaluation metrics noted for each:
The tokenizer is trained separately and the remaining components are trained
sharing the same transformer component using multi-task learning. The final
pipeline is assembled with the
senter disabled by default so that sentence
boundaries are set by the parser, which is the same design used in
spaCy’s trained pipelines. We’ll see in the
evaluation where it makes sense to use the senter vs. the parser for sentence
We selected 28 UD v2.5 corpora to benchmark using this configuration. The corpora share the following characteristics:
- 20K+ training tokens
- whitespace is used to separate tokens
- no or few multi-word tokens
- license permitting commercial use
The CoNLL 2018 evaluation metrics for UD v2.5 are shown for Stanza, Trankit and spaCy in the following table.
- The Stanza and Trankit numbers are copied from Trankit’s model performance overview.
- spaCy’s CoNLL-U converter copies
XPOSis missing, so
AllFeatsare omitted in the averages and in the evaluations for several corpora: Danish-DDT, French-Sequoia, Norwegian-Bokmaal, Norwegian-Nynorsk, Portuguese-Bosque.
In general, spaCy’s performance is very close to Trankit Base for larger corpora and solidly in between Stanza and Trankit Base for smaller corpora. The part-of-speech tags and morphological features are on par with Trankit Base while UAS/LAS are slightly lower.
For smaller corpora and languages with rich morphology, spaCy’s edit tree lemmatizer is slightly worse than Stanza’s seq2seq lemmatizer. For corpora where lemmatization is primarily a segmentation task rather than a generation task (Korean-GSD, Korean-Kaist), the edit tree lemmatizer outperforms Stanza.
For most UD corpora and especially for smaller corpora, spaCy’s separate
senter sentence segmenter performs better than the default parser-based
segmentation. In general, if sentence boundaries are marked by punctuation, the
senter component performs well, requiring much less training data than the
udv25 pipeline from
Explosion’s Hugging Face Hub repo. Find the
link to install the pipeline at the top right under “Use in spaCy”.
Install the model from the Huggingface Hub link
python -m pip install https://huggingface.co/explosion/en_udv25_englishewt_trf/resolve/main/en_udv25_englishewt_trf-any-py3-none-any.whl
Be aware that this will additionally install
spacy-experimental to provide the
experimental tokenizer and lemmatizer. If you haven’t already installed
transformers, you might want to have a look at our recommended
Once the pipeline is installed, load it like any other spaCy pipeline:
import spacy spacy.prefer_gpu() nlp = spacy.load("en_udv25_englishewt_trf")
If you would like to train and evaluate the same pipelines yourself, start with the UD benchmark project:
Clone the project
python -m spacy project clone benchmarks/ud_benchmark cd ud_benchmark
By default, this project trains a pipeline on UD_English-EWT:
Download data, train, assemble and evaluate
python -m spacy project assets python -m spacy project run all
You can edit
project.yml to switch to a different UD corpus or edit the
configs to try out different pipeline and training settings. See the full
spaCy project docs for more information on
working with the project assets, templates and remote storage.
These pipelines are published for benchmarking purposes and are not intended for production use. In production, a rule-based tokenizer for languages with whitespace or a language-specific word segmenter such as SudachiPy for Japanese is a better choice than the experimental tokenizer, which is not optimized for speed or memory use.
If you’re working with a specific language, you may be able to train a better,
smaller model with a language-specific transformer model in place of
xlm-roberta-base. spaCy can provide language-specific transformer
spacy init config --lang lang --gpu config.cfg or in the
training quickstart with the GPU