Universal Dependencies v2.5 Benchmarks for spaCy

To demonstrate the performance of spaCy v3.2, we present a series of UD benchmarks comparable to the Stanza and Trankit evaluations on Universal Dependencies v2.5, using the evaluation from the CoNLL 2018 Shared Task.

The benchmarks show the competitive performance of spaCy’s core components for tagging, parsing and sentence segmentation and also let us highlight and evaluate the new edit tree lemmatizer. The trained pipelines in the benchmarks are made available for download on Explosion’s Hugging Face Hub repo and a UD benchmark project lets you run the full training and evaluation for any Universal Dependencies corpus.

The core syntactic annotation is performed by built-in spaCy components:

XPOS: Tagger
UPOS+UFeats: Morphologizer
Head+Deprel+Sentences: Parser
Sentences (optional, alternative to parser): Senter

Experimental components are used for tokenization and lemmatization:

Tokens: experimental character-based NER tokenizer
Lemma: experimental edit tree lemmatizer

Aside from the tokenizer, the pipeline components are trained with a single transformer component using xlm-roberta-base, similar to Trankit Base.

spaCy and Universal Dependencies

While many spaCy pipelines are trained on Universal Dependencies corpora, we haven’t published full Universal Dependencies benchmarks in the past because spaCy v2 and v3 pipelines have primarily relied on rule-based components for tokenization and lemmatization, which are good for speed in production, but not for training from scratch for a language that only has partial support in spaCy or where spaCy’s defaults don’t align well with the corpus annotation scheme.

Tokenization presents a particular problem, since every single error lowers the ceiling for the performance of all the following components. In order to give spaCy’s core components a fair shake in comparison with other libraries, we switch from a fast rule-based tokenizer to a slower trainable tokenizer that doesn’t require any manual customization. This new experimental tokenizer uses spaCy’s built-in NER component under the hood to segment on the character level, following the idea behind the Elephant tokenizer (Evang et al. 2013).

For lemmatization, we use the new experimental edit tree lemmatizer, which we recently added along with the experimental tokenizer to our new spacy-experimental package, where we plan to provide in-progress features and components while we refine and evaluate them for inclusion in the core spaCy library.

Multi-word tokens

The other remaining issue for spaCy and Universal Dependencies is multi-word tokens (MWTs), which don’t fit well into spaCy’s Doc objects. A spaCy Doc aligns each token directly with a series of characters in the input text and it doesn’t support multiple token texts or multiple tokenizations within the same document. As a result, especially for cases where the UD word forms don’t correspond to the token form in the text, it’s difficult to implement an MWT expander for spaCy because the annotation couldn’t be stored easily on the text-based tokens in the Doc.

For now, we side-step this mismatch and focus on UD corpora with no or few MWTs, since this gives a more accurate impression of the performance of spaCy’s pipeline components. For the corpora with a small number of MWTs, we use spaCy’s CoNLL-U converter to merge MWTs into single tokens that have the text of the original token with linguistic features merged from the word annotations. The lower “Words” scores do cascade into the remaining evaluation metrics, but you can get a better impression of the performance of spaCy itself from the aligned accuracy scores.

Configuration and training

The pipeline has the following configuration, with the relevant UD evaluation metrics noted for each:

The tokenizer is trained separately and the remaining components are trained sharing the same transformer component using multi-task learning. The final pipeline is assembled with the senter disabled by default so that sentence boundaries are set by the parser, which is the same design used in spaCy’s trained pipelines. We’ll see in the evaluation where it makes sense to use the senter vs. the parser for sentence segmentation.

Benchmarks

We selected 28 UD v2.5 corpora to benchmark using this configuration. The corpora share the following characteristics:

20K+ training tokens
whitespace is used to separate tokens
no or few multi-word tokens
license permitting commercial use

The CoNLL 2018 evaluation metrics for UD v2.5 are shown for Stanza, Trankit and spaCy in the following table.

View full table

The Stanza and Trankit numbers are copied from Trankit’s model performance overview.
spaCy’s CoNLL-U converter copies UPOS values to XPOS if XPOS is missing, so XPOS and AllFeats are omitted in the averages and in the evaluations for several corpora: Danish-DDT, French-Sequoia, Norwegian-Bokmaal, Norwegian-Nynorsk, Portuguese-Bosque.

In general, spaCy’s performance is very close to Trankit Base for larger corpora and solidly in between Stanza and Trankit Base for smaller corpora. The part-of-speech tags and morphological features are on par with Trankit Base while UAS/LAS are slightly lower.

For smaller corpora and languages with rich morphology, spaCy’s edit tree lemmatizer is slightly worse than Stanza’s seq2seq lemmatizer. For corpora where lemmatization is primarily a segmentation task rather than a generation task (Korean-GSD, Korean-Kaist), the edit tree lemmatizer outperforms Stanza.

For most UD corpora and especially for smaller corpora, spaCy’s separate senter sentence segmenter performs better than the default parser-based segmentation. In general, if sentence boundaries are marked by punctuation, the senter component performs well, requiring much less training data than the parser.

Try out a pipeline

Install any udv25 pipeline from Explosion’s Hugging Face Hub repo. Find the link to install the pipeline at the top right under “Use in spaCy”.

Install the model from the Huggingface Hub link
python -m pip install https://huggingface.co/explosion/en_udv25_englishewt_trf/resolve/main/en_udv25_englishewt_trf-any-py3-none-any.whl

Be aware that this will additionally install spacy-experimental to provide the experimental tokenizer and lemmatizer. If you haven’t already installed transformers, you might want to have a look at our recommended installation steps.

Once the pipeline is installed, load it like any other spaCy pipeline:

import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_udv25_englishewt_trf")

UD benchmark project

If you would like to train and evaluate the same pipelines yourself, start with the UD benchmark project:

Clone the project
python -m spacy project clone benchmarks/ud_benchmark
cd ud_benchmark

By default, this project trains a pipeline on UD_English-EWT:

Download data, train, assemble and evaluate
python -m spacy project assets
python -m spacy project run all

You can edit project.yml to switch to a different UD corpus or edit the configs to try out different pipeline and training settings. See the full spaCy project docs for more information on working with the project assets, templates and remote storage.

In addition, you can use spacy-huggingface-hub to upload spaCy pipelines to your own repo, complete with model cards generated from the spaCy pipeline metadata.

Notes on use

These pipelines are published for benchmarking purposes and are not intended for production use. In production, a rule-based tokenizer for languages with whitespace or a language-specific word segmenter such as SudachiPy for Japanese is a better choice than the experimental tokenizer, which is not optimized for speed or memory use.

If you’re working with a specific language, you may be able to train a better, smaller model with a language-specific transformer model in place of xlm-roberta-base. spaCy can provide language-specific transformer recommendations with spacy init config --lang lang --gpu config.cfg or in the training quickstart with the GPU option.