In 2016 we trained a sense2vec model on the 2015 portion of the Reddit comments corpus, leading to a useful library and one of our most popular demos. That work is now due for an update. In this post, we present a new version of the library, new vectors, new evaluation recipes, and a demo NER project that we trained to usable accuracy in just a few hours.
sense2vec (Trask et. al, 2015) is a twist on
the word2vec family of algorithms that lets you learn more interesting word
vectors. Before training the model, the text is preprocessed with linguistic
annotations, to let you learn vectors for more precise concepts.
Part-of-speech tags are particularly helpful: many words have very different
senses depending on their part of speech, so it’s useful to be able to query for
the synonyms of
duck|NOUN separately. Named entity annotations
and noun phrases can also help, by letting you learn vectors for multi-word
We’ve loved this idea for a long time, so it’s now time to revisit it and take it a little bit further. First, we trained a new sense2vec model on the 2019 Reddit comments, which makes for an interesting contrast to the previous 2015 vectors. We’ve also updated the interactive demo and given the sense2vec library an overdue update, taking advantage of spaCy v2’s pipeline component and extension attribute systems. Finally, we prepared an end-to-end named entity recognition use-case for the technique, to show sense2vec’s practical applications. We used our annotation tool Prodigy to quickly bootstrap a terminology list using the sense2vec vectors, which we then used to produce a rule-based baseline for a new NER task. We then used Prodigy to quickly annotate training and evaluation data, which we then combine with spaCy’s transfer-learning functionality to train a model that achieves 82.1% accuracy.
We’ve had a lot of fun exploring our previous sense2vec model. The annotations make the results much more informative, especially the results for named entities or niche concepts. Contrasting the results between the 2015 and 2019 models provides an interesting new avenue to explore. Four years is a relatively short time on the scale of language change, but it’s easy to find examples of words, topics and entities that had very different associations in 2015.
|Query||Most similar 2015||Most similar 2019|
|Billy Ray Cyrus 1||Madonna, Kurt Cobain, Michael Jackson, Eminem, Tupac||Lil Nas X, Old Town Road, Post Malone, Lil Pump, Miley Cyrus|
|Harvey Weinstein 2||Martin Scorsese, Adam Sandler, Quentin Tarantino, Johnny Depp||Kevin Spacey, Bill Cosby, R. Kelly, Bryan Singer|
|AOC 3||BenQ, Monitor, 21.5, 60Hz, monitor-i2269vw, ViewSonic||Alexandria Ocasio-Cortez, Bernie Sanders, Nancy Pelosi, Elizabeth Warren|
|flex 4||grip, stiffen, lift, tighten||internet muscles, showoff, brag|
|ghost 5||stealth, insta, ether, blast, teleport||unmatch, mutual fade, flake, friendzone|
- Since Lil Nas X’s “Old Town Road” remix, country singer Billy Ray Cyrus shows up more similar to hip young rappers.
- In 2019, Harvey Weinstein was mostly discussed in similar contexts to other famous men accused or convicted of sexual assault.
- In 2015, “AOC” mostly referred to “active optical cable” or a Taiwanese electronics company as opposed to U.S. Representative Alexandria Ocasio-Cortez.
- “to flex” has become slang for showing off, which is reflected in the vectors.
- Interestingly, “to ghost” wasn’t very common in 2015. In 2019, it’s mostly used in the context of cutting off communication by “ghosting”.
Examples like Billy Ray Cyrus show how artists can take on new associations as they reinvent themselves over time. The results also show new word senses, and new public figures such as Alexandria Ocasio-Cortez. Social and political events can also cast public figures in a very different light. Many of the crimes discussed by the #MeToo movement were widely known in 2015, but even figures with several decades of credible accusations were discussed primarily in association with their work, rather than their crimes. The two vector models show several examples where that has changed. It will be interesting to see how the associations for these figures evolve further. Music, films and other art will always prompt new discussion, so the associations may shift back over time.
sense2vec is a Python package to
load and query vectors of words and multi-word phrases based on
part-of-speech tags and entity labels. It can be used as a standalone
library, or as a GitHub repo spaCy
pipeline component] with convenient custom attributes and methods for accessing
all relevant phrases in a
Doc object, or querying vectors and most similar
entries for tokens, noun phrases or entity spans. For more examples and the full
API documentation, see the
Word vectors are often evaluated with a mix of small quantitative test sets, and informal qualitative review. Common test sets include tasks such as analogical reasoning, or human similarity judgments, using a Likert scale. However, established test sets often don’t correspond well to the data being used, or the definition of similarity that the application requires. In practice, most data scientists fall back on the qualitative evaluation: they look at a few example queries, and ship the model if things mostly look okay. While few data scientists would endorse this as best practice, the qualitative evaluation does have important advantages. There are many different definitions of similarity, so whether one model is “better” than another is subjective. However, even if the evaluation is subjective, it’s still good to be a bit more rigorous. With better methodology, you can get more reliable results for the same amount of time spent inspecting the vectors.
Our evaluation workflows are implemented as Prodigy recipes, but you can also implement your own version as a command line script or using a static file. We initially experimented with an evaluation that would pick three random phrases from the most frequent entries and ask whether A was more similar to B or C. If you mostly agree with the model here, we could conclude that the trained vectors are reasonable and representative. It was a nice idea in theory. However, the lack of constraints and randomness produced some very interesting questions that were funny, but ultimately not very useful:
To solve this, we needed to limit the random selection to entries with the same
sense and limit the included senses so you’re not just evaluating the similarity
of punctuation (is
; more similar to
!?) or arbitrary interjections
rofl more similar to
lmao?). Other strategies we implemented
only select candidates from the most similar phrases of the target phrase or the
first candidate phrase. You can try out the different workflows in Prodigy using
--strategy option specifying the candidate selection strategy
Instead of evaluating similarity triples, you can also opt for a more direct
evaluation strategy. In the
recipe, you’re shown a query word and its nearest neighbors. You can then mark
which of the neighbors you consider to be undesirable results. This approach is
especially helpful for lower quality models, where there are some results that
stand out as clearly in error.
performs an A/B evaluation of two sense2vec vector models by comparing the most
similar entries they return for a random phrase. The UI shows two randomized
options with the most similar entries of each model and highlights the phrases
that differ. At the end of the annotation session the overall stats and
preferred model are shown. This is sort of like a blind taste-test: you
don’t have to know ahead of time what exactly you’re looking for from the
similarity results. It takes little time to compare two models reliably, and you
or another person can revisit the judgments later to check agreement.
We found the A/B evaluation to be the most useful approach when we were developing the vectors for this post. We used it to compare the 2019 and 2015 vector models, and were able to reach a conclusion after only 20 minutes of evaluation. We found that we preferred the 2019 output in 33 cases, the 2015 output in 17 cases, and were indifferent in an additional 51 cases. We had earlier performed a similar evaluation with a draft vectors model trained only on the January 2019 comments, in which we found we decisively preferred the 2015 outputs. We also ran evaluations to decide the choice of algorithm (skipgram with negative samples vs GloVe, preferring the skipgrams output), the window size (5 vs. 15, preferring the narrower window), and character vs word features (choosing to disable FastText’s character ngrams).
Let’s say you want to build an application that can detect fashion brands in online comments. Some brand names, like “adidas” or “Louis Vuitton” are pretty unambiguous. Others, like “New Balance” or “The North Face” are trickier, especially if you can’t rely on proper capitalization or correct spelling. To solve this problem, you could train a named entity recognizer to predict fashion brands in context, using labelled examples for training and evaluation and maybe pretrained representations to initialize the model.
The sense2vec library includes Prodigy workflows to speed up your data collection process with sense2vec vectors. Prodigy is a modern, fully-scriptable annotation tool for creating training data for machine learning models. For this example, we used pretrained sense2vec vectors to bootstrap an NER model. In under 2 hours, we created a dataset of 785 training examples and 500 evaluation examples from scratch and trained a model that achieved 82.1% accuracy on the task.
Even if your end goal is to train a neural network model, it’s often useful to start off with a rule-based baseline that you can evaluate the model against later. Maybe you’re training a model and are getting an accuracy of 85% on your task. That’s not bad. But if you know that a simple gazetteer or a few regular expressions will get you to 92% accuracy on the same dataset, the model looks much less impressive. You probably wouldn’t want to be shipping that as the only end-to-end solution.
One way to quickly set up a rule-based baseline or proof of concept is to create
a list of some of the most frequent instances of the entities and then match
them in the data. If you have a few examples of the entities you’re looking for,
sense2vec can help you find similar terms. All you have to do is select the
terms you want to keep and update the query vector with the already selected
recipe implements this as Prodigy workflow:
Here’s how it works under the hood. For each seed term, we can call
Sense2vec.get_best_sense to find the best-matching sense with the highest
Sense2vec.most_similar, we can then query the vectors for the
most similar entries compared to the average of the seed vectors:
As we annotate, the answers are sent back to the server and stored in the database. With each new batch of yes/no decisions, we can update our list of query phrases used in the similarity calculation, so the suggestions we see stay consistent with our previous decisions. Once you get into a good rhythm, annotation is super quick – in about 10 minutes, we were able to annotate 145 phrases in total and create a word list of 100 fashion brands.
To extract brand names from our raw data, we can now convert the word list to
match patterns. If you’ve used spaCy’s
Matcher] before, you’re probably familiar with the syntax: each pattern
consists of a list of dictionaries describing the individual tokens and their
helper converts the previously created dataset to a case-insensitive (or
optionally case-sensitive) JSONL file.
the most convenient way to use match patterns for rule-based named entity
recognition. It can be loaded from a JSONL and added to the processing pipeline.
The matched spans are then added to the
Depending on your use case, you might even choose to stop right here, or simply continue to build a larger terminology list. Many entity recognition problems have a fairly limited set of entities, with perhaps a few thousand different names. Gathering these entities as a list is a fine strategy to recognise them in running text – you don’t necessarily need a statistical model. Even if the entities are potentially ambiguous, like “Amazon”, you can often train a text classifier to recognise your domain, and then apply your rule-based entity approach. This is often both cheaper to annotate and more accurate, as you’re basing the statistical decision on a much wider context. However, with social media text like Reddit, rule-based approaches will usually hit a lower accuracy limit. In cases like this, the initial rule-based approach can be used as part of an annotation process, to quickly collect training and evaluation data so you can move forward with a statistical approach.
The entity ruler and its patterns serialize with the
nlp object when you call
nlp.to_disk. So you can export and package the model to use it in your
application, or load it into Prodigy’s
ner.make-gold recipe, which
streams in text, pre-highlights the
doc.ents and lets you correct them. You
could also use the
EntityRuler to pre-label data in spaCy, export it and then
correct it using any other tool or process. Even if your rules only get 50% of
the entities right, that’s still a lot less work for you.
In about 2 hours we worked through 2,516 texts and created 1,735 annotations (1,235 training examples, 500 evaluation examples). As this was a quick project, and we didn’t collect separate development and evaluation sets, we performed only a limited set of experiments, and avoided detailed hyperparameter searches. We’ve limited our experiments to spaCy, but you can use the annotations in any other NER system instead. If you run the experiments, please let us know!
|Accuracy||Words / Sec 1||Data|
|spaCy NER: blank||65.7||77.3||57.1||13k||72k||1235||~2h|
|spaCy NER: vectors4||73.4||81.5||66.8||13k||72k||1235||~2.5h|
|spaCy NER: vectors4 + tok2vec5||82.1||83.5||80.7||5k||68k||1235||~2.5h|
- Words per second measured on 300k words.
- Number of examples used for training.
- Total time spent on annotation and training, excl. pretraining vectors and representations.
- Common Crawl GloVe vectors.
- Trained on 1 billion words from Reddit comments using
spacy pretrainpredicting the GloVe vectors (~8 hours on GPU).
With default settings and no transfer learning, spaCy achieves an F-score of
65.7. On a small dataset like this, we recommend using pretrained word
vectors, and using them in combination with the
spacy pretrain command. spaCy makes both
techniques quite easy to use, and together they improve the F-score to 82.05 on
this data. This matches up well with our general experience: we’ve found that
spacy pretrain usually improves accuracy by about as much as pretrained word
vectors. When the training corpus is large (over one million words), the impact
from both techniques is usually small – less than 1 F-score. But on smaller
datasets, it’s not uncommon to see the 50% error reduction demonstrated here.
spacy pretrain command implements a cloze-style language modelling
objective, similar to the one introduced by
Devlin et al. (2018). However, we have an
additional innovation to make the objective work well for spaCy’s small
architectures: instead of predicting the exact word ID, we predict the word’s
vector, using a static embeddings table (we use the GloVe vectors for this,
with a cosine loss). We ran
spacy pretrain for one epoch on about 1 billion
words of Reddit comments, which completed in roughly 8 hours on a single GTX
2080 Ti. spaCy’s default token-vector encoding settings are a depth 4
convolutional neural network with width 96, and hash embeddings with 2000 rows.
These settings result in a very small model: the trainable weights are only 3.8
MB in total, including the word vectors. To take better advantage of the
pretraining, we increased these slightly, to depth 8, width 128 and 10,000 rows.
The larger model is still quite small: only 17.8 MB. Here’s the exact command we
We hope that the resources we’ve presented in this post will be useful to both researchers and practictioners, especially the library and the trained vector models. There’s also a lot further the idea could be taken. We mostly replicated the preprocessing logic from our 2016 post, to keep the comparison simple – but all preprocessing steps can now be easily customized. There are many improvements we’re looking forward to trying out, including more precise detection of non-compositional noun phrases, and handling other types of multi-word expression, especially phrasal verbs. Lemmatization might also be a useful preprocess, especially when we apply the same idea to languages other than English.