Semantic Analysis of the Reddit Hivemind

We parsed every comment posted to Reddit in 2015 and 2019, and trained different word2vec models for each year. A lot's happened over the last four years, so many words, people or events have different associations. You can also try searching for a phrase that's more than the sum of its parts to see what the model thinks it means. Try your favourite band, slang words, technical things, or something totally random.

{{ result.text }}{{ result.score }}%
Nothing found.

How does this work?

We used spaCy to tag and parse comments posted to Reddit in 2015 and 2019, and trained word vectors for more precise contexts using words and phrases and their part-of-speech tags and entity label. This allows querying synonyms of duck|VERB and duck|NOUN separately and getting meaningful vectors for multi-word expressions.

Read the blog post

Try sense2vec

The sense2vec library is a Python implementation for loading and querying sense2vec models. It can be used as a standalone module, or as a spaCy pipeline component.

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("./s2v_reddit_2015_md")
vector = s2v["natural_language_processing|NOUN"]
most_similar = s2v.most_similar("duck|VERB", n=10)