spaCy now speaks Chinese, Japanese, Danish, Polish and Romanian! Version 2.3 of the spaCy Natural Language Processing library adds models for five new languages. We’ve also updated all 15 model families with word vectors and improved accuracy, while also decreasing model size and loading times for models with vectors.
This is the last major release of v2, by the way. We’ve been working hard on
spaCy v3, which comes with a lot of cool improvements, especially for training,
configuration and custom modeling. We’ll start making prereleases on
spacy-nightly
soon, so stay tuned.
New languages
spaCy v2.3 provides new model families for five languages: Chinese, Danish, Japanese, Polish and Romanian. The Chinese and Japanese language models are the first provided models that use external libraries for word segmentation rather than spaCy’s tokenizer.
Chinese
The new Chinese models use pkuseg
for word segmentation and ship with a custom model trained on OntoNotes with a
token accuracy of 94.6%. Users can initialize the tokenizer with both pkuseg
and custom models and customize the user dictionary. Details can be found in the
Chinese docs. The Chinese tokenizer
continues to support jieba
as the default word segmenter along with
character-based segmentation as in v2.2.
Japanese
The updated Japanese language class switches to
SudachiPy for word
segmentation and part-of-speech tagging. Using sudachipy
greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
pip install spacy[ja]
. More details are in the
Japanese docs.
Model Performance
Following our usual convention, the sm
, md
and lg
models differ in their
word vectors. The lg
models include one word vector for most words in the
training data, while the md
model prunes the vectors table to only include
entries for the 20,000 most common words, mapping less frequent words to the
most similar vector in the reduced table. The sm
models do not use pretrained
vectors.
Language | Model | Size | TAG | UAS | LAS | ENTS F |
---|---|---|---|---|---|---|
Chinese | zh_core_web_sm | 45 MB | 89.63 | 68.55 | 63.21 | 66.57 |
zh_core_web_md | 75 MB | 90.23 | 69.39 | 64.43 | 68.52 | |
zh_core_web_lg | 575 MB | 90.55 | 69.77 | 64.99 | 69.33 | |
Danish | da_core_news_sm | 16 MB | 92.79 | 80.48 | 75.65 | 72.79 |
da_core_news_md | 46 MB | 94.13 | 82.71 | 78.98 | 81.45 | |
da_core_news_lg | 546 MB | 94.95 | 82.53 | 78.99 | 82.73 | |
Japanese | ja_core_news_sm | 7 MB | 97.30 | 88.68 | 86.87 | 59.93 |
ja_core_news_md | 37 MB | 97.30 | 89.26 | 87.76 | 67.68 | |
ja_core_news_lg | 526 MB | 97.30 | 88.94 | 87.55 | 70.48 | |
Polish | pl_core_news_sm | 46 MB | 98.03 | 85.61 | 78.09 | 81.32 |
pl_core_news_md | 76 MB | 98.28 | 90.41 | 84.47 | 84.68 | |
pl_core_news_lg | 576 MB | 98.45 | 90.80 | 85.52 | 85.67 | |
Romanian | ro_core_news_sm | 13 MB | 95.65 | 87.20 | 79.79 | 71.05 |
ro_core_news_md | 43 MB | 96.32 | 88.69 | 81.77 | 75.42 | |
ro_core_news_lg | 545 MB | 96.78 | 88.87 | 82.05 | 76.71 |
The training data for Danish, Japanese and Romanian is relatively small, so the pretrained word vectors improve accuracy quite a lot, in particular for NER. The Chinese model uses a larger training corpus, but word segmentation errors may make the word vectors less effective. Word segmentation accuracy also explains some of the lower scores for Chinese, as the model has to get the word segmentation correct before it can be scored as accurate on any of the subsequent tasks.
Word vectors for all model families
All model families now include medium and large models with 20k and 500k unique vectors respectively. For most languages, spaCy v2.3 introduces custom word vectors trained using spaCy’s language-specific tokenizers on data from OSCAR and Wikipedia. The vectors are trained using FastText with the same settings as FastText’s word vectors (CBOW, 300 dimensions, character n-grams of length 5).
In particular for languages with smaller training corpora, the addition of word vectors greatly improves the model accuracy. For example, the Lithuanian tagger increases from 81.7% for the small model (no vectors) to 89.3% for the large model. The parser increases by a similar margin and the NER F-score increases from 66.0% to 70.1%. For German, updating the word vectors increases the scores for the medium model for all components by 1.5 percentage points across the board.
Updated training data
All spaCy training corpora based on Universal Dependencies corpora have been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish). The updated data improves the quality and size of the training corpora, increasing the tagger and parser accuracy for all provided models. For example, the Dutch training data is extended to include both UD Dutch Alpino and LassySmall, which improves the tagger and parser scores for the small models by 3%, and the addition of the new word vectors improve the scores further by 3-5%.
Fine-grained POS tags
As a result of the updates, many of the fine-grained part-of-speech tag sets will differ from v2.2 models. The coarse-grained tag-set remains the same, although there are some minor differences in how they are calculated from the fine-grained tags.
For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tag
sets contain new merged tags related to contracted forms, such as ADP_DET
for
French "au"
, which maps to UPOS ADP
based on the head "à"
. This increases
the accuracy of the models by improving the alignment between spaCy’s
tokenization and Universal Dependencies multi-word tokens used for contractions.
Smaller models and faster loading times
The medium model packages with 20k vectors are at least 2× smaller than in
v2.2, the large English model is 120M smaller, and the loading times are
2-4× faster for all models with vectors. To achieve this, models no longer
store derivable lexeme attributes such as lower
and is_alpha
and the
remaining lexeme attributes (norm
, cluster
and prob
) have been moved to
spacy-lookups-data
.
If you’re training new models, you’ll probably want to install
spacy-lookups-data
for normalization and lemmatization tables! The provided
models include the norm
lookup tables for use with the core pipeline
components, but the optional cluster
and prob
features are now only
available through spacy-lookups-data
.
Free online course and tutorials
We’re also proud to announce updates and translations of our online course, “Advanced NLP with spaCy”. We’ve made a few small updates to the English version, including new videos to go with the interactive exercises. It’s really the translations we’re excited about though. We have translations into Japanese, German and Spanish, with Chinese, French and Russian soon to come.
Speaking of videos, you should also check out Sofie’s tutorial on training a custom entity linking model with spaCy. You can find the code and data in our growing projects repository.
Another cool video to check out is the new episode of Vincent Warmerdam’s “Intro to NLP with spaCy” . The series lets you sit beside Vincent as he works through an example data science project using spaCy. In episode 5, “Rules vs. Machine Learning”, Vincent uses spaCy’s rule-based matcher to probe the decisions of the NER model he trained previously, using the rules to understand the model’s behavior and figure out how to improve the training data to get better results.
What’s next?
spaCy v2.3 is the last big release of v2. We’ve been working hard on v3, which we expect to start publishing prereleases of in the next few weeks. spaCy v3 comes with a lot of cool improvements, especially for training, configuration and custom modeling. The training and data formats are the main thing we’ve taken the opportunity to fix, so v3 will have some breaking changes, but don’t worry — it’s nothing like the big transformations seen in libraries like TensorFlow or Angular. It should be pretty easy to upgrade, but we’ve still tried to backport as much as possible into v2.3, so you can use it right away. We’ll also continue to make maintenance releases of v2.3 with bug fixes as they come in.
We also have a big release of our annotation tool Prodigy
pretty much ready to go. In addition to the spaCy v2.3 update (giving you all
the new models), Prodigy v1.10 comes with a new annotation interface for tasks
like relation extraction and coreference resolution, full-featured
audio and video annotation (including recipes using
pyannote.audio
models in the
loop), a new and improved manual image UI, more options for NER annotation, new
recipe callbacks, and lots more. To get notified when it’s ready, follow us
on Twitter!