sense2vec reloaded: contextually-keyed word vectors

· by Matthew Honnibal & Ines Montani · ~18 min. read

In 2016 we trained a sense2vec model on the 2015 portion of the Reddit comments corpus, leading to a useful library and one of our most popular demos. That work is now due for an update. In this post, we present a new version of the library, new vectors, new evaluation recipes, and a demo NER project that we trained to usable accuracy in just a few hours.

sense2vec (Trask et. al, 2015) is a twist on the word2vec family of algorithms that lets you learn more interesting word vectors. Before training the model, the text is preprocessed with linguistic annotations, to let you learn vectors for more precise concepts. Part-of-speech tags are particularly helpful: many words have very different senses depending on their part of speech, so it’s useful to be able to query for the synonyms of duck|VERB and duck|NOUN separately. Named entity annotations and noun phrases can also help, by letting you learn vectors for multi-word expressions.

We’ve loved this idea for a long time, so it’s now time to revisit it and take it a little bit further. First, we trained a new sense2vec model on the 2019 Reddit comments, which makes for an interesting contrast to the previous 2015 vectors. We’ve also updated the interactive demo and given the sense2vec library an overdue update, taking advantage of spaCy v2’s pipeline component and extension attribute systems. Finally, we prepared an end-to-end named entity recognition use-case for the technique, to show sense2vec’s practical applications. We used our annotation tool Prodigy to quickly bootstrap a terminology list using the sense2vec vectors, which we then used to produce a rule-based baseline for a new NER task. We then used Prodigy to quickly annotate training and evaluation data, which we then combine with spaCy’s transfer-learning functionality to train a model that achieves 82.1% accuracy.

Exploring the vectors

We’ve had a lot of fun exploring our previous sense2vec model. The annotations make the results much more informative, especially the results for named entities or niche concepts. Contrasting the results between the 2015 and 2019 models provides an interesting new avenue to explore. Four years is a relatively short time on the scale of language change, but it’s easy to find examples of words, topics and entities that had very different associations in 2015.

QueryMost similar 2015Most similar 2019
Billy Ray Cyrus 1Madonna, Kurt Cobain, Michael Jackson, Eminem, TupacLil Nas X, Old Town Road, Post Malone, Lil Pump, Miley Cyrus
Harvey Weinstein 2Martin Scorsese, Adam Sandler, Quentin Tarantino, Johnny DeppKevin Spacey, Bill Cosby, R. Kelly, Bryan Singer
AOC 3BenQ, Monitor, 21.5, 60Hz, monitor-i2269vw, ViewSonicAlexandria Ocasio-Cortez, Bernie Sanders, Nancy Pelosi, Elizabeth Warren
flex 4grip, stiffen, lift, tighteninternet muscles, showoff, brag
ghost 5stealth, insta, ether, blast, teleportunmatch, mutual fade, flake, friendzone
  1. Since Lil Nas X’s “Old Town Road” remix, country singer Billy Ray Cyrus shows up more similar to hip young rappers.
  2. In 2019, Harvey Weinstein was mostly discussed in similar contexts to other famous men accused or convicted of sexual assault.
  3. In 2015, “AOC” mostly referred to “active optical cable” or a Taiwanese electronics company as opposed to U.S. Representative Alexandria Ocasio-Cortez.
  4. “to flex” has become slang for showing off, which is reflected in the vectors.
  5. Interestingly, “to ghost” wasn’t very common in 2015. In 2019, it’s mostly used in the context of cutting off communication by “ghosting”.

Examples like Billy Ray Cyrus show how artists can take on new associations as they reinvent themselves over time. The results also show new word senses, and new public figures such as Alexandria Ocasio-Cortez. Social and political events can also cast public figures in a very different light. Many of the crimes discussed by the #MeToo movement were widely known in 2015, but even figures with several decades of credible accusations were discussed primarily in association with their work, rather than their crimes. The two vector models show several examples where that has changed. It will be interesting to see how the associations for these figures evolve further. Music, films and other art will always prompt new discussion, so the associations may shift back over time.

sense2vec reloaded: the updated library

sense2vec is a Python package to load and query vectors of words and multi-word phrases based on part-of-speech tags and entity labels. It can be used as a standalone library, or as a GitHub repo spaCy pipeline component] with convenient custom attributes and methods for accessing all relevant phrases in a Doc object, or querying vectors and most similar entries for tokens, noun phrases or entity spans. For more examples and the full API documentation, see the GitHub repo.

Standalone usagefrom sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]
Usage as a spaCy componentimport spacy
from sense2vec import Sense2VecComponent

nlp = spacy.load("en_core_web_sm")
s2v = Sense2VecComponent(nlp.vocab).from_disk("/path/to/s2v_reddit_2015_md")

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
# [(('machine learning', 'NOUN'), 0.8986967),
#  (('computer vision', 'NOUN'), 0.8636297),
#  (('deep learning', 'NOUN'), 0.8573361)]

Evaluating the vectors

Word vectors are often evaluated with a mix of small quantitative test sets, and informal qualitative review. Common test sets include tasks such as analogical reasoning, or human similarity judgments, using a Likert scale. However, established test sets often don’t correspond well to the data being used, or the definition of similarity that the application requires. In practice, most data scientists fall back on the qualitative evaluation: they look at a few example queries, and ship the model if things mostly look okay. While few data scientists would endorse this as best practice, the qualitative evaluation does have important advantages. There are many different definitions of similarity, so whether one model is “better” than another is subjective. However, even if the evaluation is subjective, it’s still good to be a bit more rigorous. With better methodology, you can get more reliable results for the same amount of time spent inspecting the vectors.

Our evaluation workflows are implemented as Prodigy recipes, but you can also implement your own version as a command line script or using a static file. We initially experimented with an evaluation that would pick three random phrases from the most frequent entries and ask whether A was more similar to B or C. If you mostly agree with the model here, we could conclude that the trained vectors are reasonable and representative. It was a nice idea in theory. However, the lack of constraints and randomness produced some very interesting questions that were funny, but ultimately not very useful:

Which one is more similar? Lauren Cohen vs. less cargo vs. GTA5. * vs. : vs. =
Interesting philosophical questions, but not very useful for evaluating vectors

To solve this, we needed to limit the random selection to entries with the same sense and limit the included senses so you’re not just evaluating the similarity of punctuation (is ; more similar to ? or !?) or arbitrary interjections (is rofl more similar to loool or lmao?). Other strategies we implemented only select candidates from the most similar phrases of the target phrase or the first candidate phrase. You can try out the different workflows in Prodigy using the sense2vec.eval recipe with the --strategy option specifying the candidate selection strategy random, most_similar or most_least_similar.

Which one is more similar? Youtube vs. Netflix vs. Reddit. Iran Contra vs. Watergate vs. the War on Drugs.
More useful evaluation limited to candidates of the same and most relevant sense

Instead of evaluating similarity triples, you can also opt for a more direct evaluation strategy. In the sense2vec.eval-most-similar recipe, you’re shown a query word and its nearest neighbors. You can then mark which of the neighbors you consider to be undesirable results. This approach is especially helpful for lower quality models, where there are some results that stand out as clearly in error.

Screenshot of evaluating most similar entries

A/B evaluation

The sense2vec.eval-ab workflow performs an A/B evaluation of two sense2vec vector models by comparing the most similar entries they return for a random phrase. The UI shows two randomized options with the most similar entries of each model and highlights the phrases that differ. At the end of the annotation session the overall stats and preferred model are shown. This is sort of like a blind taste-test: you don’t have to know ahead of time what exactly you’re looking for from the similarity results. It takes little time to compare two models reliably, and you or another person can revisit the judgments later to check agreement.

Screenshot of A/B evaluation
Evaluating the most similar entries of two vector models (differences are highlighted)

We found the A/B evaluation to be the most useful approach when we were developing the vectors for this post. We used it to compare the 2019 and 2015 vector models, and were able to reach a conclusion after only 20 minutes of evaluation. We found that we preferred the 2019 output in 33 cases, the 2015 output in 17 cases, and were indifferent in an additional 51 cases. We had earlier performed a similar evaluation with a draft vectors model trained only on the January 2019 comments, in which we found we decisively preferred the 2015 outputs. We also ran evaluations to decide the choice of algorithm (skipgram with negative samples vs GloVe, preferring the skipgrams output), the window size (5 vs. 15, preferring the narrower window), and character vs word features (choosing to disable FastText’s character ngrams).

Using sense2vec to bootstrap NER models

Let’s say you want to build an application that can detect fashion brands in online comments. Some brand names, like “adidas” or “Louis Vuitton” are pretty unambiguous. Others, like “New Balance” or “The North Face” are trickier, especially if you can’t rely on proper capitalization or correct spelling. To solve this problem, you could train a named entity recognizer to predict fashion brands in context, using labelled examples for training and evaluation and maybe pretrained representations to initialize the model.

The sense2vec library includes Prodigy workflows to speed up your data collection process with sense2vec vectors. Prodigy is a modern, fully-scriptable annotation tool for creating training data for machine learning models. For this example, we used pretrained sense2vec vectors to bootstrap an NER model. In under 2 hours, we created a dataset of 785 training examples and 500 evaluation examples from scratch and trained a model that achieved 82.1% accuracy on the task.

Download code and data

The pattern files, datasets and training and evaluation scripts created for this blog post are available for download in our new projects repo.

View project

Creating a terminology list and rule-based baseline

Even if your end goal is to train a neural network model, it’s often useful to start off with a rule-based baseline that you can evaluate the model against later. Maybe you’re training a model and are getting an accuracy of 85% on your task. That’s not bad. But if you know that a simple gazetteer or a few regular expressions will get you to 92% accuracy on the same dataset, the model looks much less impressive. You probably wouldn’t want to be shipping that as the only end-to-end solution.

One way to quickly set up a rule-based baseline or proof of concept is to create a list of some of the most frequent instances of the entities and then match them in the data. If you have a few examples of the entities you’re looking for, sense2vec can help you find similar terms. All you have to do is select the terms you want to keep and update the query vector with the already selected terms. The sense2vec.teach recipe implements this as Prodigy workflow:

prodigy sense2vec.teach s2v_fashion_brands ./s2v_reddit_md --seeds "adidas, Louis Vuitton, BOSS, New Balance"

Annotation strategy: The brands suggested here are what the vectors model considers most similar to our seed terms. We can then accept, reject or irgnore them. To move through the phrases more quickly, we ignore everything we don’t immediately know – we can always go through them again and add more brands later.The first selection of seed terms produced a very specific subset of brands, so we did a second round with seeds "adidas, Chanel, UGG, Betsey Johnson".

Here’s how it works under the hood. For each seed term, we can call Sense2vec.get_best_sense to find the best-matching sense with the highest frequency. Using Sense2vec.most_similar, we can then query the vectors for the most similar entries compared to the average of the seed vectors:

    s2v = Sense2Vec().from_disk("./s2v_reddit_md")
    seed_keys = [s2v.get_best_sense(seed) for seed in seeds]
    most_similar = s2v.most_similar(seed_keys, n=10)

As we annotate, the answers are sent back to the server and stored in the database. With each new batch of yes/no decisions, we can update our list of query phrases used in the similarity calculation, so the suggestions we see stay consistent with our previous decisions. Once you get into a good rhythm, annotation is super quick – in about 10 minutes, we were able to annotate 145 phrases in total and create a word list of 100 fashion brands.

To extract brand names from our raw data, we can now convert the word list to match patterns. If you’ve used spaCy’s rule-based Matcher] before, you’re probably familiar with the syntax: each pattern consists of a list of dictionaries describing the individual tokens and their attributes. The helper converts the previously created dataset to a case-insensitive (or optionally case-sensitive) JSONL file.

fashion_brand_patterns.jsonl (excerpt){"label":"FASHION_BRAND","pattern":[{"lower":"fruit"},{"lower":"of"},{"lower":"the"},{"lower":"loom"}]}

The EntityRuler is the most convenient way to use match patterns for rule-based named entity recognition. It can be loaded from a JSONL and added to the processing pipeline. The matched spans are then added to the doc.ents.

    import spacy
    from spacy.pipeline import EntityRuler

    nlp = spacy.blank("en")
    ruler = EntityRuler(nlp).from_disk("./fashion_brands_patterns.jsonl")

    doc = nlp("I like Ralph Lauren and UNIQLO.")
    print([(ent.text, ent.label_) for ent in doc.ents])
    # [('Ralph Lauren', 'FASHION_BRAND'), ('UNIQLO', 'FASHION_BRAND')]

Depending on your use case, you might even choose to stop right here, or simply continue to build a larger terminology list. Many entity recognition problems have a fairly limited set of entities, with perhaps a few thousand different names. Gathering these entities as a list is a fine strategy to recognise them in running text – you don’t necessarily need a statistical model. Even if the entities are potentially ambiguous, like “Amazon”, you can often train a text classifier to recognise your domain, and then apply your rule-based entity approach. This is often both cheaper to annotate and more accurate, as you’re basing the statistical decision on a much wider context. However, with social media text like Reddit, rule-based approaches will usually hit a lower accuracy limit. In cases like this, the initial rule-based approach can be used as part of an annotation process, to quickly collect training and evaluation data so you can move forward with a statistical approach.

Bootstrapping annotation with patterns

The entity ruler and its patterns serialize with the nlp object when you call nlp.to_disk. So you can export and package the model to use it in your application, or load it into Prodigy’s ner.make-gold recipe, which streams in text, pre-highlights the doc.ents and lets you correct them. You could also use the EntityRuler to pre-label data in spaCy, export it and then correct it using any other tool or process. Even if your rules only get 50% of the entities right, that’s still a lot less work for you.

The text in raw_data.jsonl was extracted from Reddit and includes comments from two popular fashion subreddits, r/MaleFashionAdvice and r/FemaleFashionAdvice.

prodigy ner.make-gold ner_fashion_brands ./rule-based-model ./raw-data.jsonl --loader reddit --label FASHION_BRAND --unsegmented

Annotation strategy: The incoming texts are processed by the rule-based model and all spans present in the doc.ents (in this case, pattern matches) are pre-highlighted in the UI. We can then correct them manually by adding brands our rules missed, or removing false positives.Many comments do not contain any brands, but they should still accepted and added them to the dataset. It’s just as important for the model to learn what’s not an entity.We’re ignoring auto-replies and bot messages (e.g. notice about a deleted comment) to exclude those texts from the data.


In about 2 hours we worked through 2,516 texts and created 1,735 annotations (1,235 training examples, 500 evaluation examples). As this was a quick project, and we didn’t collect separate development and evaluation sets, we performed only a limited set of experiments, and avoided detailed hyperparameter searches. We’ve limited our experiments to spaCy, but you can use the annotations in any other NER system instead. If you run the experiments, please let us know!

AccuracyWords / Sec 1Data
Rule-based baseline48.496.332.4130k130k0~15m
spaCy NER: blank65.777.357.113k72k1235~2h
spaCy NER: vectors473.481.566.813k72k1235~2.5h
spaCy NER: vectors4 + tok2vec582.183.580.75k68k1235~2.5h
  1. Words per second measured on 300k words.
  2. Number of examples used for training.
  3. Total time spent on annotation and training, excl. pretraining vectors and representations.
  4. Common Crawl GloVe vectors.
  5. Trained on 1 billion words from Reddit comments using spacy pretrain predicting the GloVe vectors (~8 hours on GPU).

With default settings and no transfer learning, spaCy achieves an F-score of 65.7. On a small dataset like this, we recommend using pretrained word vectors, and using them in combination with the spacy pretrain command. spaCy makes both techniques quite easy to use, and together they improve the F-score to 82.05 on this data. This matches up well with our general experience: we’ve found that spacy pretrain usually improves accuracy by about as much as pretrained word vectors. When the training corpus is large (over one million words), the impact from both techniques is usually small – less than 1 F-score. But on smaller datasets, it’s not uncommon to see the 50% error reduction demonstrated here.

The spacy pretrain command implements a cloze-style language modelling objective, similar to the one introduced by Devlin et al. (2018). However, we have an additional innovation to make the objective work well for spaCy’s small architectures: instead of predicting the exact word ID, we predict the word’s vector, using a static embeddings table (we use the GloVe vectors for this, with a cosine loss). We ran spacy pretrain for one epoch on about 1 billion words of Reddit comments, which completed in roughly 8 hours on a single GTX 2080 Ti. spaCy’s default token-vector encoding settings are a depth 4 convolutional neural network with width 96, and hash embeddings with 2000 rows. These settings result in a very small model: the trainable weights are only 3.8 MB in total, including the word vectors. To take better advantage of the pretraining, we increased these slightly, to depth 8, width 128 and 10,000 rows. The larger model is still quite small: only 17.8 MB. Here’s the exact command we ran:

spacy pretrain /path/to/data.jsonl en_vectors_web_lg /path/to/output
--batch-size 3000 --max-length 256 --depth 8 --embed-rows 10000
--width 128 --use-vectors -gpu 0

We hope that the resources we’ve presented in this post will be useful to both researchers and practictioners, especially the library and the trained vector models. There’s also a lot further the idea could be taken. We mostly replicated the preprocessing logic from our 2016 post, to keep the comparison simple – but all preprocessing steps can now be easily customized. There are many improvements we’re looking forward to trying out, including more precise detection of non-compositional noun phrases, and handling other types of multi-word expression, especially phrasal verbs. Lemmatization might also be a useful preprocess, especially when we apply the same idea to languages other than English.

GitHub repo sense2vec demo Prodigy

About the authors

  • Matthew Honnibal CTO, Founder

    • Berlin, Germany
  • Ines Montani CEO, Founder

    • Berlin, Germany