Introducing spaCy v2.1

Version 2.1 of the spaCy Natural Language Processing library includes a huge number of features, improvements and bug fixes. In this post, we highlight some of the things we’re especially pleased with, and explain some of the most challenging parts of preparing this big release.

spaCy is an open-source library for industrial-strength natural language processing in Python. It’s widely used in production and research systems for extracting information from text, developing smarter user-facing features, and preprocessing text for deep learning. We’ve been publishing alpha releases to spacy-nightly for months now, and encouraging users to try out the new version. Today we’re excited to finally publish spaCy v2.1.0. We’ve fixed almost every outstanding bug on the tracker, given the docs a huge makeover, improved both speed and accuracy, made installation significantly easier and faster, and developed some exciting new features. Check out the release notes for a full overview.

Language model pretraining

By far the biggest news in NLP research over 2018 was the success of language model pretraining. The basic intuition behind this has been obvious for a very long time. There’s never been much doubt that NLP models need to somehow import knowledge from raw text, as labelled training corpora tend to be too small to represent long-tailed knowledge about word meanings and usage. In 2011, deep learning methods were proving successful for NLP, and techniques for pretraining word representations were already in use. A range of techniques for pretraining further layers of the network were proposed over the years, as the deep learning hype took hold. However, no one objective for the pretraining seemed to be a knockout success on a wide range of tasks.

In 2018, a number of papers showed that a simple language modelling objective worked well for LSTM models. Devlin et al. then presented a neat modification that allowed bidirectional models to be pretrained as well. One of the major themes throughout these results was that pretraining allowed extremely large models to be used, even when the labelled data is fairly small. A team from OpenAI took this one step further, training an even larger version of Devlin et al.’s model, and showing it performs well on long-form text generation.

While these large models provide convincing demonstrations, they’re not suitable for spaCy’s main use-cases. The performance target we’ve set for ourselves is 10,000 words per second per CPU core. The v2.1 models currently run at around 8,000 words per second, so we’re already slightly behind. Clearly, we couldn’t use a model such as BERT or GPT-2 directly. But the same principle of pretraining should still apply, so long as we could find a way to scale it down.

The performance target we’ve set for ourselves is 10,000 words per second per CPU core.

Scaling down these language models to the sizes we use in spaCy posed an interesting research challenge. Language models typically use a large output layer, with one neuron per word in the vocabulary. If you’re predicting over a 10,000 word vocabulary, this means you’re predicting a vector with 10,000 elements. spaCy v2.1’s token vectors are 96 elements wide, so a naive softmax approach would be unlikely to work: we’d be trying to predict 100 elements of output for every 1 element of input. We could make the vocabulary somewhat smaller, but every word that’s out of vocabulary is a word the pretraining process will be unable to learn. Stepping back a little, the problem of so-called “one hot” representations posing representational issues for neural networks is actually quite familiar. This is exactly what algorithms like word2vec, GloVe and FastText set out to solve. Instead of a binary vector with one dimension per entry in the vocabulary, we can have a much denser real-valued representation of the same information.

The spacy pretrain command requires a word vectors model as part of the input, which it uses as the target output for each token. Instead of predicting a token’s ID as a classification problem, we learn to predict the token’s word vector. Inspired by names such as ELMo and BERT, we’ve termed this trick Language Modelling with Approximate Outputs (LMAO). Our first implementation is probably a good way to get acquainted with the idea – it’s extremely short.

As is often the case in research, it seems that LMAO is an idea whose time had come. Several other researchers have been working on related ideas independently. So far we’ve been using L2 loss in our experiments, but Kumar and Tsvetkov (2018), who were simultaneously working on a similar idea for machine translation, have developed a novel probabilistic loss using the von Mises-Fisher distribution, which they show performs significantly better than L2 in their experiments. Even more recently, Li et al. (2019) report experiments using an LMAO objective in place of the softmax layer in the ELMo pretraining system, with promising results. In our own preliminary experiments, we’ve found pretraining especially effective when limited training data is available. It helps most for text categorization and parsing, but is less effective for named entity recognition. We expect the pretraining to be increasingly important as we add more abstract semantic prediction models to spaCy, for tasks such as semantic role labelling, coreference resolution and named entity linking.

Example: 100,000 Reddit comments

As a small example, we ran spacy pretrain for the English sm and lg models using 100,000 comments from the Reddit comments corpus:

Pretraining examples
# Pretrain for the en_core_web_sm model. The sm model doesn't require the word vectors
# at runtime, while the lg model does.
python -m spacy pretrain /input/reddit-100k.jsonl en_vectors_web_lg /output

# Pretrain for the en_core_web_lg model
python -m spacy pretrain /input/reddit-100k.jsonl en_vectors_web_lg /output --use-vectors

We ran both pretraining jobs simultaneously on a Tesla V100, with each task training at around 50,000 tokens per second. We pretrained for 3 billion words (making several passes over the 100k comments), which took around 17 hours. The total cost of both jobs came out to about $40.00 on Google Compute Engine. We haven’t implemented resume logic yet, which will help decrease the cost of large scale jobs further, as it would allow the use of pre-emptible instances. This would take pretraining costs down to around $4 per billion words of training. The spacy pretrain command saves out a weights file after each pass over the data. To use the pretrained weights, we can simply pass them as an argument to spacy train:

python -m spacy train en /models/ /corpora/PTB_SD_3_3_0/train.gold.json
/corpora/PTB_SD_3_3_0/dev.gold.json --n-examples 100 --pipeline parser
--init-tok2vec pretrain-nv-model999.bin

Example: Norwegian core model

We’re also pleased to report our first independent positive result for the spaCy pretrain command. Jari Bakken and Ole Henrik Skogstrøm have been working on Norwegian Bokmål support for spaCy, using NER annotations produced by the University of Oslo. Even with a small amount of pretraining using default settings, the spacy pretrain command resulted in much better performance for all three components, the tagger, parser and entity recognizer.

Pretraining	POS	UAS	LAS	NER P	NER R	NER F
❌ no	94.60	88.59	86.10	71.96	70.54	71.24
✅ yes	95.07	90.14	87.82	78.92	78.69	78.81

Improve rule-based components

Over the years, the rule-based Matcher has become one of spaCy’s most popular features. Statistical models are great to generalize based on the context and beyond specific examples – but they can’t always beat large terminology list and application-specific rules. Rule-based systems are especially powerful when they can leverage statistical predictions, like part-of-speech tags, syntactic dependencies or entity labels.

spaCy v2.1 ships with a new matcher engine, rewritten from scratch. It resolves various issues around the use of operators and quantifiers like "OP": "?" to make a token optional. The API also introduces new predicates to express set membership or rich comparison. The following pattern matches a sequence of two tokens: a pronoun whose lowercase form isn’t “i” or “it”, followed by a verb with the base form “like” or “love”:

  pattern = [
      {"POS": "PRON", "LOWER": {"NOT_IN": ["i", "it"]}},
      {"POS": "VERB", "LEMMA": {"IN": ["like", "love"]}},
  ]

The new match pattern API now also supports a "_" key, allowing patterns to specify custom extension attribute values to match on. In this case, a token if token._.number is greater than or equal to 20:

    pattern = [{"_": {"number": {">=": 20}}}]

Rule-based entity recognition

When we introduced custom pipeline components in v2.0, many users took advantage of them to build their own rule-based entity recognizers powered by the Matcher. Whether it’s cities, gene names or units for oil drilling, many entity types can be expressed pretty unambiguously with terminology lists and token-based rules.

The EntityRuler is a useful new component that can do all of this out-of-the-box. If it’s added before the entity recognizer in the pipeline, the entities it sets directly influence the model’s predictions. The statistical entity recognizer will respect pre-defined entity spans and take them into account when predicting the entity tags for the remaining tokens, which can potentially give you a nice boost in accuracy. If the entity ruler is added after the statistical entity recognizer, it can “fill in the blanks” and catch entities that the model missed, or optionally overwrite existing predictions.

Using the entity ruler
import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
weights_pattern = [
    {"LIKE_NUM": True},
    {"LOWER": {"IN": ["g", "kg", "grams", "kilograms", "lb", "lbs", "pounds"]}}
]
patterns = [{"label": "QUANTITY", "pattern": weights_pattern}]
ruler = EntityRuler(nlp, patterns=patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp("U.S. average was 2 lbs.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('U.S.', 'GPE'), ('2 lbs', 'QUANTITY')]

A pattern can either be a list of dictionaries describing the individual tokens, or an exact string match. If you’ve been using our annotation tool Prodigy, you might recognize this format from the pattern files you can load in to bootstrap new entity types and text categories. The formats are fully compatible, so you’ll be able to use your Prodigy patterns with the entity ruler, and vice versa.

The EntityRuler is also fully serializable, making it easy to package entity rules with your spaCy models. Patterns will be saved out to the model directory as a .jsonl file (newline-delimited JSON) and loaded back in when you load the model. We’re hoping that this component can be used to power models that rely on large domain-specific terminoloy lists.

Serializing the entity ruler
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, patterns=lots_of_patterns)
nlp.add_pipe(ruler, before="ner")
nlp.to_disk("/path/to/model-with-rules")

Retokenization

spaCy has always supported merging spans of several tokens into single tokens – for example, to merge a noun phrase into one word. However, the existing Doc.merge and Span.merge implementations were inefficient when merging in bulk, because the array had to be resized each time. On top of that, it was difficult to keep track of changing token indices, and easy to end up with incorrectly merged spans.

The new Doc.retokenize context manager is specifically optimized for bulk processing. Merges are collected and performed when the context manager exits.

Retokenization with merging
doc = nlp("I moved from New York to Los Angeles")
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[3:5], attrs={"LEMMA": "New York"})
    retokenizer.merge(doc[6:8], attrs={"LEMMA": "Los Angeles"})

In addition to merging, Doc.retokenize can also split one token into several. The process requires more settings, because you need to specify the text of the individual tokens, optional per-token attributes and how the new tokens should be attached to the existing syntax tree. To prevent mismatches, the heads can be provided as tokens, or (token, subtoken) tuples if the newly split token should be attached to another subtoken.

Retokenization with splitting
doc = nlp("I live in NewYork")
with doc.retokenize() as retokenizer:
    heads = [(doc[3], 1), doc[2]]
    attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)

With better splitting and merging, we’re also well set up for better support for statistical tokenization. Tokenizing languages like English and German work fine with a rule-based approach, but for languages like Chinese, Vietnamese and Japanese, statistical models are definitely required. The v2.1 release also has some quiet improvements that will help set the stage for better support for these languages. The GoldParse class is now able to calculate many-to-many alignments between the tokenization in a Doc object and the gold-standard. When the Doc object over-segments, the parser can now learn to predict a special label that can mark the tokens for merging by a later component. With this approach, spaCy’s parser is now capable of jointly predicting tokenization, sentence segmentation and parsing, which should be very helpful for languages or genres with high mutual information between these problems.

Better, faster tokenization

The biggest issue were variable-width lookbehinds, caused by character classes which actually consist of multiple-character tokens (like ’’). This meant that they needed to be grouped into disjunctive expressions using |, rather than character classes using [] which only require a set lookup. Variable-width lookbehinds introduce a serious performance problem, because they can’t be recognized by a finite-state automaton. Essentially, you’re no longer dealing with “regular expressions” once you have these. Russ Cox (who wrote re2) has a very comprehensive overview of these issues.

The variable-width expressions crept in over time, once we had switched over to the regex library in order to make use of its better unicode support, especially for Python 2. Performance got worse bit by bit, as many of the regular expressions were adapted across a number of contributions that widened support to new languages and fixed specific problems. By the time we noticed the efficiency problems, refactoring the tokenization rules to remove the variable-width look-behinds had become a significant project, requiring focussed attention.

The problem was finally solved by Sofie Van Landeghem, in the first of what we hope will be many consulting projects for spaCy. The changes improve the efficiency of the tokenizer by two to three times, with equivalent accuracy when evaluated against the Universal Dependencies corpora. This also allows us to avoid depending on the regex module, and instead switch back to Python’s built-in re library.

A matrix multiplication odyssey

The first versions of spaCy used models trained with the Averaged Perceptron algorithm, one of the simplest machine learning models. This meant that prior to v2.0, we had no real need for a maths library – the performance bottlenecks were in the hash table and feature extraction.

All that changed in v2.0, when we switched over to neural network models. In a neural network model, the performance bottleneck is matrix multiplication. Not all matrix multiplication solutions are created equal. Using an implementation that’s well optimized for your hardware can deliver an order of magnitude better performance than a generic implementation. Worse still, different implementations have different bugs.

The upshot of all this is that if you have two numpy arrays and you write A @ B, you might find that your code runs 20 times slower on your server than your desktop, performance with pip is dramatically different than performance with conda, and your colleagues report intermittent crashes when running the code on their laptops, but only in some modules, and not in others.

We were pretty dissatisfied with that, so we set out to fix it. Our humble goal was to make sure that when spaCy multiplied its matrices together, that always called into the same library – regardless of your choice of operating system, and regardless of whether you installed spaCy using pip or conda.

Processing a bunch of text is embarrassingly parallel, so you want to make sure you’re parallelizing the outermost loop possible.

Our other humble aim was to make sure that spaCy jobs don’t launch a bunch of unwanted threads. The task of processing a bunch of text with spaCy is embarrassingly parallel, so you want to make sure you’re parallelizing the outermost loop possible. Nested parallelism is inefficient, which means the matrix multiplication library must not launch threads. This is something Accelerate, OpenBLAS and MKL all get terribly wrong.

Achieving these two humble aims turned out to be an enormous year-long struggle. First of all, what happens if we just do nothing, and use numpy? Well, numpy will delegate its matrix multiplications to a system library. The choice of system library depends on the state of your system during installation, and whether you installed numpy using pip or conda. On conda, numpy will usually be linked against Intel’s MKL library. On pip, your mileage may vary, but you’ll usually find yourself with a kernel from OpenBLAS if you’re using Windows or Linux, and the native Accelerate library on OSX. On my machines, the vendored OpenBLAS kernel often performs poorly, while the Accelerate kernel can crash when used in combination with Python’s multiprocessing module. People are working hard on all these problems, so the specifics may change within a month or two… But the basic unreliability of this approach will remain. If you can’t easily make your desktop, your colleague’s laptop and your server all run the same code, you’re going to have a bad time.

Another solution would be to pick a library, require it to be installed into the system, and link against it. On conda, this would work okay, as conda allows you to specify non-Python dependencies. With pip, the user experience from this approach is pretty bad, especially for Windows users. In order to install the system dependency, a Windows user would have to install and configure the correct compiler, and compile the library from source. This is likely to be at least a whole day of yak shaving misery. The specifics will probably also change over time, so the guides we provide will be constantly going out of date.

The only way to make sure that pip install spacy works correctly is to provide a self-contained package which includes the necessary matrix multiplication routines. This also solves the threading problems: we can make sure that no threads are launched unless we want them, without requiring unintuitive environment variables to be set.

Preparing this stand-alone package was one of the most joyless programming tasks imaginable. Many extremely unfun days were spent ensuring the solution worked successfully on Windows, OSX and Linux. Getting multilinux wheels built using the various CI solutions was another extremely frustrating saga.

At the end of it all, we’re relieved to now depend on our new package cython-blis. We’re very grateful to Field Van Zee and the rest of the Blis community for their work on these linear algebra routines, which we’ve found to offer a great blend of stability, performance and usability. We’re still waiting for our package to be merged on conda-forge, but we’ve been using cython-blis for months now on the spacy-nightly branch, and have had no problems.

No compiler required

spaCy is mostly written in Cython, and it relies on several other packages that make use of C extensions. In order for pip install spacy to work, you would need to make sure a compiler was installed and the Python development headers were available. If everything was working correctly, installation would then take a few minutes to complete. Last year, we managed to improve installation times significantly, by providing wheel installation files for spaCy and our other packages. To make this happen, we teamed up with Nathaniel Smith to build Wheelwright, a more user-friendly interface into Matthew Brett’s multibuild, an awesome contraption that uses layers of scripts and Docker containers to convince Travis and Appveyor to build wheel installation files that work on a wide variety of platforms.

For the v2.1 release, we’ve managed to consolidate or eliminate several of spaCy’s dependencies, allowing us to offer a fully wheeled installation. Here’s how spaCy’s dependencies look now:

requirements.txt
# Our libraries
cymem>=2.0.2,<2.1.0
preshed>=2.0.1,<2.1.0
thinc>=7.0.2,<7.1.0
blis>=0.2.2,<0.3.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.1.3,<1.1.0
srsly>=0.0.5,<1.1.0
# Third-party dependencies
plac<1.0.0,>=0.9.6
tqdm>=4.10.0,<5.0.0
numpy>=1.15.0
requests>=2.13.0,<3.0.0
jsonschema>=2.6.0,<3.0.0

Both numpy and requests are so widely used they’re almost part of the Python standard library, and plac, tqdm and jsonschema are very small pure Python packages. All of the other requirements are in-house, allowing us to make sure wheel installation files are available.

Minimizing our third-party dependencies also greatly increases the library’s stability. Due to Python’s import semantics, only one version of a given package can be installed in an environment at a time. This means that every third-party dependency we add increases the chance that our users will wake up to broken builds.

Minimizing our third-party dependencies greatly increases the library’s stability. Every third-party dependency we add increases the chance that our users will wake up to broken builds.

Let’s say you work on some self-contained project for a couple of months, and then put it aside. Maybe you’re a grad student working on a paper, maybe you’re a data scientist working on a prototype. Either way, you do things mostly right: you create a requirements.txt file and write the exact version of spaCy you tested with, like this: spacy==2.0.11. You write tests, have a script that repeats your experiment, and write a clear README with instructions. You test that everything works, and set it aside. Six months later someone follows your instructions exactly, and your code doesn’t work. Maybe it doesn’t install. Maybe it trains for days and at the end of all that the model fails to save. Either way, people are suddenly grumpy at you and you have to drop what you’re doing, dig through your old project that you swear used to work, and figure out what went wrong. This will be a worse day than it should’ve been.

Now, some of you will be saying “Well actually, that’s all your fault. You should’ve used pip freeze.” This is sort of true: the reason your code broke was you only specified the precise versions of your direct dependencies. spaCy v2.0.11 might have a requirement like msgpack>=0.5.0,<0.6.0. Then one day, msgpack publishes v0.5.9. If this version causes problems for spaCy, then pip install spacy==2.0.11 will have new bugs today that it didn’t have yesterday.

So, it’s true that you could’ve prevented your problems if you’d used pip freeze to fully specify all your requirements. But c’mon! If you’re walking along lost in thought and fall into a gaping hole in the sidewalk, it’s definitely true that there was an element of user error. You could have just walked around it. But, also, why the hell is there a gaping hole in the sidewalk here?

The thing is, pip freeze isn’t even a fully general solution. If all the projects pin all their dependencies, users will constantly be faced with artificial incompatibilities. Every now and then, someone will ask that your constraints be relaxed, so that your project can be used in combination with someone else’s. SemVer is supposed to prevent this, but whether a change is “breaking” can be very subtle. Even updating a single default value in your API could be a breaking change for someone. But marking every change as breaking is no better than marking no changes as breaking. If every release increments the major version, you may as well have just one number. The other two only help if you draw a distinction.

For v2.1, we’ve managed to remove dependence on the following libraries: ujson, dill, regex, msgpack, msgpack-numpy, cytoolz, wrapt and six. The biggest improvement came from replacing the serialization libraries with our own package srsly. Every dependency we introduce slightly increases the maintenance burden for all packages and projects that depend on spaCy. The more quickly that dependency moves, the bigger the problem. Unmaintained dependencies also pose a potential problem, as they introduce a security risk. That’s why the changes we’ve made to spaCy’s dependency tree are some of the improvements we’re most happy with. What you’ll notice immediately is the improved installation times and reduced system requirements from fully wheel-compatible installation. But what we’re looking forward to most is the peace of mind: spaCy should be more stable going forward, and be more readily compatible with the rest of the Python ecosystem.

Resources

spaCy v2.1: What’s new in v2.1
Release notes: Detailed overview

How to advocate for modular NLP in the age of Generative AI

Introducing spaCy v2.1

Language model pretraining

Example: 100,000 Reddit comments

Pretraining examples

Example: Norwegian core model

Improve rule-based components

Rule-based entity recognition

Using the entity ruler

Serializing the entity ruler

Retokenization

Retokenization with merging

Retokenization with splitting

Better, faster tokenization

A matrix multiplication odyssey

No compiler required

requirements.txt

Resources

How to advocate for modular NLP in the age of Generative AI

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

What the history of the web can teach us about the future of AI

From PDFs to AI-ready structured data: a deep dive