A high-coverage word embedding table will usually be quite large. One million 32-bit floats occupies 4MB of memory, so one million 300-dimension vectors will be 1.2GB in size. Such a large model size is at least annoying for many applications, while for others it’s completely prohibitive.
There are three obvious approaches to reducing the size of the embedding table:
-
Reduce the number of words in the vocabulary.
-
Reduce the number of dimensions per vector.
-
Reduce the number of bits per dimension.
While all three of these options can be effective, there’s also a less obvious solution:
-
Cheat, using a probabilistic data structure.
Probabilistic data structures are a natural fit for machine learning models, so they’re quite widely used. However, they’re definitely unintuitive, which is why we refer to this solution as a “cheat”. We’ll start by introducing the full algorithm, without dwelling too long on why it works. We’ll then go back and fill in more of the intuition, and then describe how we use it in practice in Thinc, spaCy and floret.
The Bloom embeddings algorithm
In a normal embedding table, each word-string is mapped to a distinct ID.
Usually these IDs will be sequential, so if you have a vocabulary of 100 words,
your words will be mapped to numbers range(100)
. The sequential IDs can then
be used as indices into an embedding table: if you have 100 words in your
vocabulary, you have 100 rows in the table, and each word receives its own
vector.
However, there’s no limit to the number of unique words that might occur in a sample of text, while we definitely want a limited number of rows in our embedding table. Some of the rows in our table will therefore need to be shared between multiple words in our vocabulary. One obvious solution is to set aside a single vector in the table. Words 0-98 will each receive their own vector, while all other words are assigned to vector 99.
However, this asks vector 99 to do a lot of work. What if we gave more vectors to the unknown words?
Instead of only one unknown word vector, we could have 10: words 0–89 would be assigned their own vector as before. However, all other words would be randomly mapped to the remaining 10 vectors:
def get_row(word_id, number_vector=100, number_oov=10): number_known = number_vector - number_oov if word_id < number_known: return word_id else: return number_known + (word_id % number_oov)
This gives the model a little more resolution for the unknown words. If all out-of-vocabulary words are assigned the same vector, then they’ll all look identical to the model. Even if the training data actually includes information that shows two different out-of-vocabulary words have important, different implications – for instance, if one word is a strong indicator of positive sentiment, while the other is a strong indicator of negative sentiment – the model won’t be able to tell them apart. However, if we have 10 buckets for the unknown words, we might get lucky, and assign these words to different buckets. If so, the model would be able to learn that one of the unknown-word vectors makes positive sentiment more likely, while the other vector makes negative sentiment more likely.
If this is good, then why not do more of it? Bloom embeddings are like an extreme version, where every word is handled like the unknown words above: there are 100 vectors for the “unknown” portion, and 0 for the “known” portion.
So far, this approach seems weird, but not necessarily good. The part that makes it unfairly effective is the next step: by simply doing the same thing multiple times, we can greatly improve the resolution, and have unique representations for far more words than we have vectors. The code in full:
import numpyimport mmh3
def allocate(n_vectors, n_dimensions): table = numpy.zeros((n_vectors, n_dimensions), dtype='f') table += numpy.random.uniform(-0.1, 0.1, table.size).reshape(table.shape) return table
def get_vector(table, word): hash1 = mmh3.hash(word, seed=0) hash2 = mmh3.hash(word, seed=1) row1 = hash1 % table.shape[0] row2 = hash2 % table.shape[0] return table[row1] + table[row2]
def update_vector(table, word, d_vector): hash1 = mmh3.hash(word, seed=0) hash2 = mmh3.hash(word, seed=1) row1 = hash1 % table.shape[0] row2 = hash2 % table.shape[0] table[row1] -= 0.001 * d_vector table[row2] -= 0.001 * d_vector
In this example, we’ve used two keys, assigned from two random hash functions. It’s unlikely that two words will collide on both keys, so by simply summing the vectors together, we’ll assign most words a unique representation.
For the sake of illustration, let’s step through a very small example, explicitly.
Let’s say we have this vocabulary of 20 words:
vocab = ['apple', 'strawberry', 'orange', 'juice', 'drink', 'smoothie', 'eat', 'fruit', 'health', 'wellness', 'steak', 'fries', 'ketchup', 'burger', 'chips', 'lobster', 'caviar', 'service', 'waiter', 'chef']
We’ll embed these into two dimensions. Normally this would give us a table of
(20, 2)
floats, which we would randomly initialise. With the hashing trick, we
can make the table smaller. Let’s give it 15 vectors:
normal = numpy.random.uniform(-0.1, 0.1, (20, 2))hashed = numpy.random.uniform(-0.1, 0.1, (15, 2))
In the normal table, we want to map each word in our vocabulary to its own vector:
word2id = {}def get_normal_vector(word, table): if word not in word2id: word2id[word] = len(word2id) return normal[word2id[word]]
The hashed table only has 15 rows, so some words will have to share. We’ll
handle this by mapping the word into an arbitrary integer – called a “hash
value”. The hash function will return an arbitrary integer, which we’ll mod into
the range (0, 15)
. Importantly, we need to be able to compute multiple,
distinct hash values for each key – so Python’s built-in hash function is
inconvenient. We’ll therefore use MurmurHash.
Let’s see what keys we get for our 20 vocabulary items, using MurmurHash:
hashes1 = [mmh3.hash(w, 1) % 15 for w in vocab]assert hashes1 == [3, 6, 4, 13, 8, 3, 13, 1, 9, 12, 11, 4, 2, 13, 5, 10, 0, 2, 10, 13]
As you can see, some keys are shared between multiple words, while 2/15 keys are unoccupied. This is obviously unideal! If multiple words have the same key, they’ll map to the same vector – as far as the model is concerned, “strawberry” and “heart” will be indistinguishable. It won’t be clear which word was used – they have the same representation.
To address this, we simply hash the words again, this time using a different seed – so that we get a different set of arbitrary keys:
from collections import Counter
hashes2 = [mmh3.hash(w, 2) % 15 for w in vocab]assert len(Counter(hashes2).most_common()) == 12
This one’s even worse – 3 keys unoccupied! But our strategy is not to keep drawing until we get a favorable seed. Instead, consider this:
assert len(Counter(zip(hashes1, hashes2))) == 20
By combining the results from the two hashes, our 20 words distribute perfectly, into 20 unique combinations. This makes sense: we expect to have some words overlapping on one of the keys, but we’d have to be very unlucky for a pair of words to overlap on both keys.
This means that if we simply add the two vectors together, each word once more has a unique representation:
for word in vocab: key1 = mmh3.hash(word, 0) % 15 key2 = mmh3.hash(word, 1) % 15 vector = hashed[key1] + hashed[key2] print(word, '%.3f %.3f' % tuple(vector))
apple 0.161 0.163strawberry 0.128 -0.024orange 0.157 -0.047juice -0.017 -0.023drink 0.097 -0.124smoothie 0.085 0.024eat 0.000 -0.105fruit -0.060 -0.053health 0.166 0.103wellness 0.011 0.065steak 0.155 -0.039fries 0.157 -0.106ketchup 0.076 0.127burger 0.045 -0.084chips 0.082 -0.037lobster 0.138 0.067caviar 0.066 0.098service -0.017 -0.023waiter 0.039 0.001chef -0.016 0.030
We now have a function that maps our 20 words to 20 unique vectors – but we’re storing weights for only 15 vectors in memory. Now the question is: will we be able to find values for these weights that let us actually map words to useful vectors?
Let’s do a quick experiment to see how this works. We’ll assign “true” values for our little vocabulary, and see how well we can approximate them with our compressed table. To get the “true” values, we could put the “science” in data science, and drag the words around into reasonable-looking clusters. But for our purposes, the actual “true” values don’t matter. We’ll therefore just do a simulation: we’ll assign random vectors as the “true” state, and see if we can learn values for the hash embeddings that match them.
The learning procedure will be a simple stochastic gradient descent:
import numpyimport mmh3
numpy.random.seed(0)
nb_epoch = 20learn_rate = 0.001nr_hash_vector = 15
words = [str(i) for i in range(20)]true_vectors = numpy.random.uniform(-0.1, 0.1, (len(words), 2))hash_vectors = numpy.random.uniform(-0.1, 0.1, (nr_hash_vector, 2))examples = list(zip(words, true_vectors))
for epoch in range(nb_epoch): numpy.random.shuffle(examples) loss = 0.0 for word, truth in examples: key1 = mmh3.hash(word, 0) % nr_hash_vector key2 = mmh3.hash(word, 1) % nr_hash_vector
hash_vector = hash_vectors[key1] + hash_vectors[key2]
diff = hash_vector - truth
hash_vectors[key1] -= learn_rate * diff hash_vectors[key2] -= learn_rate * diff loss += (diff**2).sum() print(epoch, loss)
It’s worth taking some time to play with this simulation. You can start by doing some sanity checks:
- How does the loss change with
nr_hash_vector
? - If you remove
key2
, does the loss go up? - What happens if you add more hash keys?
- What happens as the vocabulary size increases?
- What happens when more dimensions are added?
- How sensitive are the hash embeddings to the initial conditions? If we change the random seed, do we ever get unlucky?
If you play with the simulation for a while, you’ll start to get a good feel for the dynamics, and hopefully you’ll have a clear idea of why the technique works.
HashEmbed: Bloom embeddings in Thinc and spaCy
spaCy’s MultiHashEmbed
and HashEmbedCNN
use the
HashEmbed
layer from
Thinc to construct small vector tables with Bloom embeddings
for spaCy’s CNN pipelines like en_core_web_sm
.
HashEmbed
uses MurmurHash to hash a 64-bit key, which is typically a value
from the StringStore
, to four rows in a
small hash table. The final embedding is the sum of the four rows from the
table.
HashEmbed Vectors
import numpyfrom thinc.api import get_current_opsfrom spacy.strings import StringStorevocab = ['apple', 'strawberry', 'orange', 'juice']seed = 0nr_hash_vector = 15numpy.random.seed(seed)hash_vectors = numpy.random.uniform(-0.1, 0.1, (nr_hash_vector, 2))strings = StringStore()orths = numpy.asarray([strings[w] for w in vocab], dtype="uint64")ops = get_current_ops()# Ops.hash(): four hashes per item using MurmurHashrows = ops.hash(orths, seed) % nr_hash_vectorvocab_vectors = hash_vectors[rows].sum(axis=1)
Word | ORTH (uint64 ) | Hashed Rows | Vectors |
---|---|---|---|
apple | 8566208034543834098 | 6, 4, 11, 14 | 0.103 0.101 |
strawberry | 11202628424926476707 | 5, 3, 2, 11 | 0.023 0.169 |
orange | 2208928596161743350 | 5, 6, 4, 11 | 0.157 0.124 |
juice | 5041695539596503283 | 14, 6, 5, 9 | 0.132 0.148 |
If there are multiple attributes such as [ORTH, PREFIX, SUFFIX, SHAPE]
,
MultiHashEmbed
uses a different seed for each attribute and then concatenates
the vectors for each individual attribute to create the context-independent
Tok2Vec
embeddings.
As an example, let’s take a look at the config for MultiHashEmbed
:
[components.tok2vec.model.embed]@architectures = "spacy.MultiHashEmbed.v2"width = 96attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]rows = [5000,2500,2500,2500]
There are four attributes with 5000 rows for representing ORTH
, 2500 for
PREFIX
, etc. The entire embedding table contains 12,500 96-dimension vectors,
which takes up less than 5MB.
And in terms of the model size, there’s yet another advantage to relying on hash functions to map strings to rows: given the same hash function, the same string always maps to the same hash, so the model does not even need to store a list of known vocab items like it would for a traditional sequentially-numbered vocab. Any arbitrary string can be mapped into this table, and the model does not require a stored vocab.
floret: Bloom embeddings for fastText
floret is an extended version of fastText that uses Bloom embeddings to create compact vector tables with both word and subword information.
fastText uses character n-gram subwords to be able to provide vectors for any
possible word. A word’s vector is the average of the vector for the full word
(if available) and the vectors for all of its subwords. For example, the vector
for apple
with 4-gram subwords would be the average of the vectors for the
following strings (<
and >
are added as word boundary characters):
<apple><appapplppleple>
fastText also supports a range of n-gram sizes, so with 4-grams through 6-grams, you’d have:
<apple><appapplppleple><applapplepple><appleapple>
By using subwords, fastText models can provide useful vectors for previously
unseen tokens like appletrees
by using subwords like <appl
and tree
.
Instead of having a single UNK
-vector, fastText models with subwords can
produce better representations for infrequent, novel or noisy words.
Internally fastText stores the word and subword vectors in two separate tables.
The word table contains a fixed vocab of words corresponding to tokens above a
minimum count in the training data, typically 1–2M words. For subwords,
fastText runs into the same issue as we did for our original vector example: it
wants to store vectors for a very large number of subwords in a fixed-sized
table. fastText uses the same idea with using the full table for the “unknown”
portion and hashes each subword into a large table. Since there are typically a
very large number of possible subwords (for just 26 lowercase letters there are
nearly 12M possible 5-grams), the default hash table size is relatively large at
2M rows. (As a side note, the huge subword table is why fastText .bin
files
are so large even for small vocabs.)
The exported .vec
table that you can import into a spaCy pipeline only
includes the known words, so while the fastText training takes subwords into
account, the spaCy pipeline vectors still end up restricted to a fixed list of
known vocab words. spaCy could potentially support both the word and subword
tables, however with 2M word vectors and 2M subword vectors with 300 dimensions,
you’re looking at 4GB+ data, which is definitely prohibitive in many cases.
Here is where Bloom embeddings come into play: with the hashing trick, it’s possible to both greatly reduce the size of the vector table and also support subwords. We can have performant vectors with subword support in under 100MB with just two small changes to fastText:
-
store both word and subword vectors in the same hash table
-
hash each entry into more than one row
floret extends fastText to implement
these two options. In floret mode, the hashing algorithm switches from
fastText’s default hash function to MurmurHash for easy integration with Thinc
and spaCy. Because each token can be broken down into a large number of
subwords, using four hashes per entry as in HashEmbed
would mean that a long
word might correspond to 100+ rows in the table, which can become slow and
unwieldy. In practice, just two hashes perform well in medium-sized tables
(50–200K) and it’s faster both in training and inference.
With subword support in spaCy vectors, you can see improvements for misspellings and noisy data, for long compounds in languages like German, and across the board for agglutinative languages like Finnish, Korean or Turkish. Even with a floret vector table with 50K entries, which is trained on a relatively small amount of text, you can see large improvements in performance.
fastText vs. floret vectors for Korean
A demo example for Korean with 3GB training text, 50K floret vectors vs. 800K keys for the standard vectors:
Vectors | TAG | POS | DEP UAS | DEP LAS |
---|---|---|---|---|
none | 72.5 | 85.3 | 74.0 | 65.0 |
standard (pruned: 50K vectors for 800K keys) | 77.3 | 89.1 | 78.2 | 72.2 |
standard (unpruned: 800K vectors/keys) | 79.0 | 90.3 | 79.4 | 73.9 |
floret (minn 2, maxn 3; 50K vectors, no OOV) | 82.8 | 94.1 | 83.5 | 80.5 |
You can try out floret in several demo projects:
pipelines/floret_vectors_demo
: train and import toy English vectorspipelines/floret_wiki_oscar_vectors
: train vectors on Wikipedia and OSCAR for any supported languagepipelines/floret_fi_core_demo
: train floret vectors on OSCAR and compare standard vectors vs. floret vectors on UD Finnish TDT and Turku NER Corpuspipelines/floret_ko_ud_demo
: train floret vectors on OSCAR and compare standard vectors vs. floret vectors on UD Korean Kaist
Coming soon: exploring the advantages of floret for noisy data, novel words, rich morphology and more!