The following are some hasty preliminary notes on how spaCy works. The short story is, there are no new killer algorithms. The way that the tokenizer works is novel and a bit neat, and the parser has a new feature set, but otherwise the key algorithms are well known in the recent literature.
Some might also wonder how I get Python code to run so fast. I don’t — spaCy is written in Cython, an optionally statically-typed language that compiles to C or C++, which is then loaded as a C extension module. This makes it easy to achieve the performance of native C code, but allows the use of Python language features, via the Python C API. The Python unicode library was particularly useful to me. I think it would have been much more difficult to write spaCy in another language.
Tokenization is the task of splitting a string into meaningful pieces, called tokens, which you can then compute with. In practice, the task is usually to match the tokenization performed in some treebank, or other corpus. If we want to apply a tagger, entity recogniser, parser etc, then we want our run-time text to match the training conventions. If we want to use a model that’s been trained to expect “isn’t” to be split into two tokens, [“is”, “n’t”], then that’s how we need to prepare our data.
In order to train spaCy’s models with the best data available, I therefore tokenize English according to the Penn Treebank scheme. It’s not perfect, but it’s what everybody is using, and it’s good enough.
The Penn Treebank was distributed with a script called tokenizer.sed, which tokenizes ASCII newswire text roughly according to the Penn Treebank standard. Almost all tokenizers are based on these regular expressions, with various updates to account for unicode characters, and the fact that it’s no longer 1986 — today’s text has URLs, emails, emoji, etc.
Usually, the resulting regular expressions are applied in multiple passes, which is quite inefficient. Often no care is taken to preserve indices into the original string. If you lose these indices, it’ll be difficult to calculate mark-up based on your annotations.
spaCy’s tokenizer assumes that no tokens will cross whitespace — there will be no multi-word tokens. If we want these, we can post-process the token-stream later, merging as necessary. This assumption allows us to deal only with small chunks of text. We can cache the processing of these, and simplify our expressions somewhat.
Here is what the outer-loop would look like in Python. (You can see the production implementation, in Cython, here.)
The actual work is performed in
_tokenize_substring. For this, I divide the
tokenization rules into three pieces:
- A prefixes expression, which matches from the start of the string;
- A suffixes expression, which matches from the end of the string;
- A special-cases table, which matches the whole string.
The algorithm then proceeds roughly like this (consider this like pseudo-code; this was written quickly and has not been executed):
This procedure splits off tokens from the start and end of the string, at each point checking whether the remaining string is in our special-cases table. If it is, we stop splitting, and return the tokenization at that point.
The advantage of this design is that the prefixes, suffixes and special-cases can be declared separately, in easy-to-understand files. If a new entry is added to the special-cases, you can be sure that it won’t have some unforeseen consequence to a complicated regular-expression grammar.
As mentioned above, the tokenizer is designed to support easy caching. If all we were caching were the matched substrings, this would not be so advantageous. Instead, what we do is create a struct which houses all of our lexical features, and cache that. The tokens are then simply pointers to these rich lexical types.
In a sample of text, vocabulary size grows exponentially slower than word count. So any computations we can perform over the vocabulary and apply to the word count are efficient.
In 2013, I wrote a blog post describing how to write a good part of speech tagger. My recommendation then was to use greedy decoding with the averaged perceptron. I think this is still the best approach, so it’s what I implemented in spaCy.
The tutorial also recommends the use of Brown cluster features, and case normalization features, as these make the model more robust and domain independent. spaCy’s tagger makes heavy use of these features.
The parser uses the algorithm described in my 2014 blog post. This algorithm, shift-reduce dependency parsing, is becoming widely adopted due to its compelling speed/accuracy trade-off.
Some quick details about spaCy’s take on this, for those who happen to know these models well. I’ll write up a better description shortly.
- I use greedy decoding, not beam search;
- I use the arc-eager transition system;
- I use the Goldberg and Nivre (2012) dynamic oracle.
- I use the non-monotonic update from my CoNLL 2013 paper (Honnibal, Goldberg and Johnson 2013).
So far, this is exactly the configuration from the CoNLL 2013 paper, which scored 91.0. So how have I gotten it to 92.4? The following tweaks:
- I use Brown cluster features — these help a lot;
- I redesigned the feature set. I’ve long known that the Zhang and Nivre (2011) feature set was suboptimal, but a few features don’t make a very compelling publication. Still, they’re important.
- When I do the dynamic oracle training, I also make the upate cost-sensitive: if the oracle determines that the move the parser took has a cost of N, then the weights for the gold class are incremented by +N, and the weights for the predicted class are incremented by -N. This only made a small (0.1-0.2%) difference.
I don’t do anything algorithmically novel to improve the efficiency of the parser. However, I was very careful in the implementation.
A greedy shift-reduce parser with a linear model boils down to the following loop:
The parser makes 2N transitions for a sentence of length N. In order to select the transition, it extracts a vector of K features from the state. Each feature is used as a key into a hash table managed by the model. The features map to a vector of weights, of length C. We then dot product the feature weights to the scores vector we are building for that instance.
The inner-most loop here is not so bad: we only have a few dozen classes, so pit’s just a short dot product. Both of the vectors are in the cache, so this pis a snack to a modern CPU.
The bottle-neck in this algorithm is the 2NK look-ups into the hash-table that we must make, as these almost always have to hit main memory. The feature-set is enormously large, because all of our features are one-hot boolean indicators. Some of the features will be common, so they’ll lurk around in the CPU’s cache hierarchy. But a lot of them won’t be, and accessing main memory takes a lot of cycles.
I used to use the Google densehashmap implementation. This seemed a solid choice: it came from a big brand, it was in C++, and it seemed very complicated. Later, I read Jeff Preshing’s excellent post on open-addressing with linear probing. This really spoke to me. I had assumed that a fast hash table implementation would necessarily be very complicated, but no — this is another situation where the simple strategy wins.
I’ve packaged my Cython implementation separately from spaCy, in the package preshed — for “pre-hashed”, but also as a nod to Preshing. I’ve also taken great care over the feature extraction and perceptron code, which I’m distributing in a package named thinc (since it’s for learning very sparse models with Cython).
By the way: from comparing notes with a few people, it seems common to implement linear models in a way that’s suboptimal for multi-class classification. The mistake is to store in the hash-table one weight per (feature, class) pair, rather than mapping the feature to a vector of weights, for all of the classes. This is bad because it means you need to hit the table C times, one per class, as you always need to evaluate a feature against all of the classes. In the case of the parser, this means the hash table is accessed 2NKC times, instead of the 2NK times if you have a weights vector. You should also be careful to store the weights contiguously in memory — you don’t want a linked list here. I use a block-sparse format, because my problems tend to have a few dozen classes.
I guess if I had to summarize my experience, I’d say that the efficiency of these models is really all about the data structures. We want to stay small, and stay contiguous. Minimize redundancy and minimize pointer chasing. That’s why Cython is so well suited to this: we get to lay out our data structures, and manage the memory ourselves, with full C-level control.