Version 2.1 of the spaCy Natural Language Processing library includes a huge number of features, improvements and bug fixes. In this post, we highlight some of the things we’re especially pleased with, and explain some of the most challenging parts of preparing this big release.
spaCy is an open-source library for industrial-strength natural language
processing in Python. It’s widely used in production and research systems for
extracting information from text, developing smarter user-facing features, and
preprocessing text for deep learning. We’ve been publishing alpha releases to
spacy-nightly
for months now, and encouraging users to try out the new
version. Today we’re excited to finally publish spaCy v2.1.0. We’ve fixed almost
every outstanding bug on the tracker, given the docs a
huge makeover, improved both speed and accuracy, made installation significantly
easier and faster, and developed some exciting new features. Check out the
release notes for a
full overview.
Language model pretraining
By far the biggest news in NLP research over 2018 was the success of language model pretraining. The basic intuition behind this has been obvious for a very long time. There’s never been much doubt that NLP models need to somehow import knowledge from raw text, as labelled training corpora tend to be too small to represent long-tailed knowledge about word meanings and usage. In 2011, deep learning methods were proving successful for NLP, and techniques for pretraining word representations were already in use. A range of techniques for pretraining further layers of the network were proposed over the years, as the deep learning hype took hold. However, no one objective for the pretraining seemed to be a knockout success on a wide range of tasks.
In 2018, a number of papers showed that a simple language modelling objective worked well for LSTM models. Devlin et al. then presented a neat modification that allowed bidirectional models to be pretrained as well. One of the major themes throughout these results was that pretraining allowed extremely large models to be used, even when the labelled data is fairly small. A team from OpenAI took this one step further, training an even larger version of Devlin et al.’s model, and showing it performs well on long-form text generation.
While these large models provide convincing demonstrations, they’re not suitable for spaCy’s main use-cases. The performance target we’ve set for ourselves is 10,000 words per second per CPU core. The v2.1 models currently run at around 8,000 words per second, so we’re already slightly behind. Clearly, we couldn’t use a model such as BERT or GPT-2 directly. But the same principle of pretraining should still apply, so long as we could find a way to scale it down.
The performance target we’ve set for ourselves is 10,000 words per second per CPU core.
Scaling down these language models to the sizes we use in spaCy posed an interesting research challenge. Language models typically use a large output layer, with one neuron per word in the vocabulary. If you’re predicting over a 10,000 word vocabulary, this means you’re predicting a vector with 10,000 elements. spaCy v2.1’s token vectors are 96 elements wide, so a naive softmax approach would be unlikely to work: we’d be trying to predict 100 elements of output for every 1 element of input. We could make the vocabulary somewhat smaller, but every word that’s out of vocabulary is a word the pretraining process will be unable to learn. Stepping back a little, the problem of so-called “one hot” representations posing representational issues for neural networks is actually quite familiar. This is exactly what algorithms like word2vec, GloVe and FastText set out to solve. Instead of a binary vector with one dimension per entry in the vocabulary, we can have a much denser real-valued representation of the same information.
The spacy pretrain
command requires a
word vectors model as part of the input, which it uses as the target output for
each token. Instead of predicting a token’s ID as a classification problem, we
learn to predict the token’s word vector. Inspired by names such as ELMo and
BERT, we’ve termed this trick Language Modelling with Approximate Outputs
(LMAO). Our first implementation is probably a good way to get acquainted with
the idea –
it’s extremely short.
As is often the case in research, it seems that LMAO is an idea whose time had come. Several other researchers have been working on related ideas independently. So far we’ve been using L2 loss in our experiments, but Kumar and Tsvetkov (2018), who were simultaneously working on a similar idea for machine translation, have developed a novel probabilistic loss using the von Mises-Fisher distribution, which they show performs significantly better than L2 in their experiments. Even more recently, Li et al. (2019) report experiments using an LMAO objective in place of the softmax layer in the ELMo pretraining system, with promising results. In our own preliminary experiments, we’ve found pretraining especially effective when limited training data is available. It helps most for text categorization and parsing, but is less effective for named entity recognition. We expect the pretraining to be increasingly important as we add more abstract semantic prediction models to spaCy, for tasks such as semantic role labelling, coreference resolution and named entity linking.
Example: 100,000 Reddit comments
As a small example, we ran spacy pretrain
for the English
sm
and lg
models using 100,000 comments from
the Reddit comments corpus:
Pretraining examples
# Pretrain for the en_core_web_sm model. The sm model doesn't require the word vectors# at runtime, while the lg model does.python -m spacy pretrain /input/reddit-100k.jsonl en_vectors_web_lg /output# Pretrain for the en_core_web_lg modelpython -m spacy pretrain /input/reddit-100k.jsonl en_vectors_web_lg /output --use-vectors
We ran both pretraining jobs simultaneously on a Tesla V100, with each task
training at around 50,000 tokens per second. We pretrained for 3 billion words
(making several passes over the 100k comments), which took around 17 hours. The
total cost of both jobs came out to about
$40.00 on Google Compute Engine. We haven’t implemented resume logic yet, which will help decrease the cost of large scale jobs further, as it would allow the use of pre-emptible instances. This would take pretraining costs down to around $4
per billion words of training. The
spacy pretrain
command saves out a
weights file after each pass over the data. To use the pretrained weights, we
can simply pass them as an argument to
spacy train
:
python -m spacy train en /models/ /corpora/PTB_SD_3_3_0/train.gold.json/corpora/PTB_SD_3_3_0/dev.gold.json --n-examples 100 --pipeline parser--init-tok2vec pretrain-nv-model999.bin
Example: Norwegian core model
We’re also pleased to report our first independent positive result for the spaCy
pretrain command. Jari Bakken and
Ole Henrik Skogstrøm have been working on
Norwegian Bokmål support for spaCy, using NER annotations produced by the
University of Oslo. Even with a small amount of pretraining using default
settings, the spacy pretrain
command resulted in much better performance for
all three components, the tagger, parser and entity recognizer.
Pretraining | POS | UAS | LAS | NER P | NER R | NER F |
---|---|---|---|---|---|---|
❌ no | 94.60 | 88.59 | 86.10 | 71.96 | 70.54 | 71.24 |
✅ yes | 95.07 | 90.14 | 87.82 | 78.92 | 78.69 | 78.81 |
Improve rule-based components
Over the years, the
rule-based Matcher
has become
one of spaCy’s most popular features. Statistical models are great to generalize
based on the context and beyond specific examples – but they can’t always beat
large terminology list and application-specific rules. Rule-based systems are
especially powerful when they can leverage statistical predictions, like
part-of-speech tags, syntactic dependencies or entity labels.
spaCy v2.1 ships with a new matcher engine, rewritten from scratch. It resolves
various issues around the use of operators and quantifiers like "OP": "?"
to
make a token optional. The API also introduces
new predicates
to express set membership or rich comparison. The following pattern matches a
sequence of two tokens: a pronoun whose lowercase form isn’t “i” or “it”,
followed by a verb with the base form “like” or “love”:
pattern = [ {"POS": "PRON", "LOWER": {"NOT_IN": ["i", "it"]}}, {"POS": "VERB", "LEMMA": {"IN": ["like", "love"]}}, ]
The new match pattern API now also supports a "_"
key, allowing patterns to
specify custom extension attribute values to match on. In this case, a
token if token._.number
is greater than or equal to 20:
pattern = [{"_": {"number": {">=": 20}}}]
Rule-based entity recognition
When we introduced custom pipeline
components in v2.0, many users took advantage of them to build their own
rule-based entity recognizers powered by the Matcher
. Whether it’s cities,
gene names or units for oil drilling, many entity types can be expressed pretty
unambiguously with terminology lists and token-based rules.
The EntityRuler
is
a useful new component that can do all of this out-of-the-box. If it’s added
before the entity recognizer in the pipeline, the entities it sets directly
influence the model’s predictions. The statistical entity recognizer will
respect pre-defined entity spans and take them into account when predicting the
entity tags for the remaining tokens, which can potentially give you a nice
boost in accuracy. If the entity ruler is added after the statistical entity
recognizer, it can “fill in the blanks” and catch entities that the model
missed, or optionally overwrite existing predictions.
Using the entity ruler
import spacyfrom spacy.pipeline import EntityRulernlp = spacy.load("en_core_web_sm")weights_pattern = [{"LIKE_NUM": True},{"LOWER": {"IN": ["g", "kg", "grams", "kilograms", "lb", "lbs", "pounds"]}}]patterns = [{"label": "QUANTITY", "pattern": weights_pattern}]ruler = EntityRuler(nlp, patterns=patterns)nlp.add_pipe(ruler, before="ner")doc = nlp("U.S. average was 2 lbs.")print([(ent.text, ent.label_) for ent in doc.ents])# [('U.S.', 'GPE'), ('2 lbs', 'QUANTITY')]
A pattern can either be a list of dictionaries describing the individual tokens, or an exact string match. If you’ve been using our annotation tool Prodigy, you might recognize this format from the pattern files you can load in to bootstrap new entity types and text categories. The formats are fully compatible, so you’ll be able to use your Prodigy patterns with the entity ruler, and vice versa.
The EntityRuler
is also fully serializable, making it easy to package entity
rules with your spaCy models. Patterns will be saved out to the model directory
as a .jsonl
file (newline-delimited JSON) and loaded back in when you load the
model. We’re hoping that this component can be used to power models that rely on
large domain-specific terminoloy lists.
Serializing the entity ruler
nlp = spacy.load("en_core_web_sm")ruler = EntityRuler(nlp, patterns=lots_of_patterns)nlp.add_pipe(ruler, before="ner")nlp.to_disk("/path/to/model-with-rules")
Retokenization
spaCy has always supported merging spans of several tokens into single tokens –
for example, to merge a noun phrase into one word. However, the existing
Doc.merge
and Span.merge
implementations were inefficient when merging in
bulk, because the array had to be resized each time. On top of that, it was
difficult to keep track of changing token indices, and easy to end up with
incorrectly merged spans.
The new
Doc.retokenize
context manager
is specifically optimized for bulk processing. Merges are collected and
performed when the context manager exits.
Retokenization with merging
doc = nlp("I moved from New York to Los Angeles")with doc.retokenize() as retokenizer:retokenizer.merge(doc[3:5], attrs={"LEMMA": "New York"})retokenizer.merge(doc[6:8], attrs={"LEMMA": "Los Angeles"})
In addition to merging, Doc.retokenize
can also split one token into several.
The process requires more settings, because you need to specify the text of the
individual tokens, optional per-token attributes and how the new tokens should
be attached to the existing syntax tree. To prevent mismatches, the heads can be
provided as tokens, or (token, subtoken)
tuples if the newly split token
should be attached to another subtoken.
Retokenization with splitting
doc = nlp("I live in NewYork")with doc.retokenize() as retokenizer:heads = [(doc[3], 1), doc[2]]attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
With better splitting and merging, we’re also well set up for better support
for statistical tokenization. Tokenizing languages like English and German
work fine with a rule-based approach, but for languages like Chinese, Vietnamese
and Japanese, statistical models are definitely required. The v2.1 release also
has some quiet improvements that will help set the stage for better support for
these languages. The GoldParse
class is now able to calculate many-to-many
alignments between the tokenization in a Doc
object and the gold-standard.
When the Doc
object over-segments, the parser can now learn to predict a
special label that can mark the tokens for merging by a later component. With
this approach, spaCy’s parser is now capable of jointly predicting
tokenization, sentence segmentation and parsing, which should be very helpful
for languages or genres with high mutual information between these problems.
Better, faster tokenization
The biggest issue were variable-width lookbehinds, caused by character classes
which actually consist of multiple-character tokens (like ’’
). This meant that
they needed to be grouped into disjunctive expressions using |
, rather than
character classes using [
] which only require a set lookup. Variable-width
lookbehinds introduce a serious performance problem, because they can’t be
recognized by a finite-state automaton. Essentially, you’re no longer dealing
with “regular expressions” once you have these. Russ Cox (who wrote re2
) has a
very comprehensive overview of
these issues.
The variable-width expressions crept in over time, once we had switched over to
the regex
library in order to make use of its better unicode support,
especially for Python 2. Performance got worse bit by bit, as many of the
regular expressions were adapted across a number of contributions that widened
support to new languages and fixed specific problems. By the time we noticed the
efficiency problems, refactoring the tokenization rules to remove the
variable-width look-behinds had become a significant project, requiring focussed
attention.
The problem was finally solved by
Sofie Van Landeghem, in the first of what we hope
will be many consulting projects for spaCy. The changes improve the efficiency
of the tokenizer by two to three times, with equivalent accuracy when
evaluated against the Universal Dependencies corpora. This also allows us to
avoid depending on the regex
module, and instead switch back to Python’s
built-in re
library.
A matrix multiplication odyssey
The first versions of spaCy used models trained with the Averaged Perceptron algorithm, one of the simplest machine learning models. This meant that prior to v2.0, we had no real need for a maths library – the performance bottlenecks were in the hash table and feature extraction.
All that changed in v2.0, when we switched over to neural network models. In a neural network model, the performance bottleneck is matrix multiplication. Not all matrix multiplication solutions are created equal. Using an implementation that’s well optimized for your hardware can deliver an order of magnitude better performance than a generic implementation. Worse still, different implementations have different bugs.
The upshot of all this is that if you have two numpy arrays and you write
A @ B
, you might find that your code runs 20 times slower on your server than
your desktop, performance with pip is dramatically different than performance
with conda, and your colleagues report intermittent crashes when running the
code on their laptops, but only in some modules, and not in others.
We were pretty dissatisfied with that, so we set out to fix it. Our humble goal was to make sure that when spaCy multiplied its matrices together, that always called into the same library – regardless of your choice of operating system, and regardless of whether you installed spaCy using pip or conda.
Processing a bunch of text is embarrassingly parallel, so you want to make sure you’re parallelizing the outermost loop possible.
Our other humble aim was to make sure that spaCy jobs don’t launch a bunch of unwanted threads. The task of processing a bunch of text with spaCy is embarrassingly parallel, so you want to make sure you’re parallelizing the outermost loop possible. Nested parallelism is inefficient, which means the matrix multiplication library must not launch threads. This is something Accelerate, OpenBLAS and MKL all get terribly wrong.
Achieving these two humble aims turned out to be an enormous year-long struggle. First of all, what happens if we just do nothing, and use numpy? Well, numpy will delegate its matrix multiplications to a system library. The choice of system library depends on the state of your system during installation, and whether you installed numpy using pip or conda. On conda, numpy will usually be linked against Intel’s MKL library. On pip, your mileage may vary, but you’ll usually find yourself with a kernel from OpenBLAS if you’re using Windows or Linux, and the native Accelerate library on OSX. On my machines, the vendored OpenBLAS kernel often performs poorly, while the Accelerate kernel can crash when used in combination with Python’s multiprocessing module. People are working hard on all these problems, so the specifics may change within a month or two… But the basic unreliability of this approach will remain. If you can’t easily make your desktop, your colleague’s laptop and your server all run the same code, you’re going to have a bad time.
Another solution would be to pick a library, require it to be installed into the system, and link against it. On conda, this would work okay, as conda allows you to specify non-Python dependencies. With pip, the user experience from this approach is pretty bad, especially for Windows users. In order to install the system dependency, a Windows user would have to install and configure the correct compiler, and compile the library from source. This is likely to be at least a whole day of yak shaving misery. The specifics will probably also change over time, so the guides we provide will be constantly going out of date.
The only way to make sure that pip install spacy
works correctly is to provide
a self-contained package which includes the necessary matrix multiplication
routines. This also solves the threading problems: we can make sure that no
threads are launched unless we want them, without requiring unintuitive
environment variables to be set.
Preparing this stand-alone package was one of the most joyless programming tasks
imaginable. Many extremely unfun days were spent ensuring the solution worked
successfully on Windows, OSX and Linux. Getting multilinux
wheels built using
the various CI solutions was another extremely frustrating saga.
At the end of it all, we’re relieved to now depend on
our new package cython-blis
. We’re
very grateful to Field Van Zee and the rest of the Blis community for their work
on these linear algebra routines, which we’ve found to offer a great blend of
stability, performance and usability. We’re still waiting for our package to be
merged on conda-forge
, but we’ve been using cython-blis
for months now on
the spacy-nightly
branch, and have had no problems.
No compiler required
spaCy is mostly written in Cython, and it relies on several other packages that
make use of C extensions. In order for pip install spacy
to work, you would
need to make sure a compiler was installed and the Python development headers
were available. If everything was working correctly, installation would then
take a few minutes to complete. Last year, we managed to improve installation
times significantly, by providing wheel installation files for spaCy and our
other packages. To make this happen, we teamed up with
Nathaniel Smith to build
Wheelwright, a more user-friendly
interface into
Matthew Brett’s multibuild
, an
awesome contraption that uses layers of scripts and Docker containers to
convince Travis and Appveyor to build wheel installation files that work on a
wide variety of platforms.
For the v2.1 release, we’ve managed to consolidate or eliminate several of spaCy’s dependencies, allowing us to offer a fully wheeled installation. Here’s how spaCy’s dependencies look now:
requirements.txt
# Our librariescymem>=2.0.2,<2.1.0preshed>=2.0.1,<2.1.0thinc>=7.0.2,<7.1.0blis>=0.2.2,<0.3.0murmurhash>=0.28.0,<1.1.0wasabi>=0.1.3,<1.1.0srsly>=0.0.5,<1.1.0# Third-party dependenciesplac<1.0.0,>=0.9.6tqdm>=4.10.0,<5.0.0numpy>=1.15.0requests>=2.13.0,<3.0.0jsonschema>=2.6.0,<3.0.0
Both numpy
and requests
are so widely used they’re almost part of the Python
standard library, and plac
, tqdm
and jsonschema
are very small pure Python
packages. All of the other requirements are in-house, allowing us to make sure
wheel installation files are available.
Minimizing our third-party dependencies also greatly increases the library’s stability. Due to Python’s import semantics, only one version of a given package can be installed in an environment at a time. This means that every third-party dependency we add increases the chance that our users will wake up to broken builds.
Minimizing our third-party dependencies greatly increases the library’s stability. Every third-party dependency we add increases the chance that our users will wake up to broken builds.
Let’s say you work on some self-contained project for a couple of months, and
then put it aside. Maybe you’re a grad student working on a paper, maybe you’re
a data scientist working on a prototype. Either way, you do things mostly right:
you create a requirements.txt
file and write the exact version of spaCy you
tested with, like this: spacy==2.0.11
. You write tests, have a script that
repeats your experiment, and write a clear README
with instructions. You test
that everything works, and set it aside. Six months later someone follows your
instructions exactly, and your code doesn’t work. Maybe it doesn’t install.
Maybe it trains for days and at the end of all that the model fails to save.
Either way, people are suddenly grumpy at you and you have to drop what you’re
doing, dig through your old project that you swear used to work, and figure out
what went wrong. This will be a worse day than it should’ve been.
Now, some of you will be saying “Well actually, that’s all your fault. You
should’ve used pip freeze
.” This is sort of true: the reason your code broke
was you only specified the precise versions of your direct dependencies. spaCy
v2.0.11
might have a requirement like msgpack>=0.5.0,<0.6.0
. Then one day,
msgpack
publishes v0.5.9
. If this version causes problems for spaCy, then
pip install spacy==2.0.11
will have new bugs today that it didn’t have
yesterday.
So, it’s true that you could’ve prevented your problems if you’d used
pip freeze
to fully specify all your requirements. But c’mon! If you’re
walking along lost in thought and fall into a gaping hole in the sidewalk, it’s
definitely true that there was an element of user error. You could have just
walked around it. But, also, why the hell is there a gaping hole in the sidewalk
here?
The thing is, pip freeze
isn’t even a fully general solution. If all the
projects pin all their dependencies, users will constantly be faced with
artificial incompatibilities. Every now and then, someone will ask that your
constraints be relaxed, so that your project can be used in combination with
someone else’s. SemVer is supposed to prevent this, but
whether a change is “breaking” can be very subtle. Even updating a single
default value in your API could be a breaking change for someone. But marking
every change as breaking is no better than marking no changes as breaking. If
every release increments the major version, you may as well have just one
number. The other two only help if you draw a distinction.
For v2.1, we’ve managed to remove dependence on the following libraries:
ujson
, dill
, regex
, msgpack
, msgpack-numpy
, cytoolz
, wrapt
and
six
. The biggest improvement came from replacing the serialization libraries
with our own package srsly
. Every
dependency we introduce slightly increases the maintenance burden for all
packages and projects that depend on spaCy. The more quickly that dependency
moves, the bigger the problem. Unmaintained dependencies also pose a potential
problem, as they introduce a security risk. That’s why the changes we’ve made to
spaCy’s dependency tree are some of the improvements we’re most happy with. What
you’ll notice immediately is the improved installation times and reduced
system requirements from fully wheel-compatible installation. But what we’re
looking forward to most is the peace of mind: spaCy should be more stable going
forward, and be more readily compatible with the rest of the Python ecosystem.