Go to top

The Tale of Bloom Embeddings and Unseen Entities

Bloom embeddings provide a powerful way to represent a large number of tokens in a memory efficient way. To put them to the test we did an apples-to-apples comparison between them and traditional embeddings on a number of named entity recognition data sets in multiple languages. We wrote a technical report about our experiments and in this blog post we’ll highlight some of our results. We will have a special focus on the elephant in the NER room: entities not present in the training set, which we refer to as unseen entities.

If you are mainly here in search of goodies, you can go ahead and check out the span-labeling-datasets. It provides commands to download and preprocess a bunch of data sets to be used with the ner and spancat components. From there, you can take a look at the ner-embeddings project that lets you run all the experiments we did for the technical report with ner! We’ll go through the features in these projects in this post.

Bloom Embeddings Warmup

To start, we’ll quickly introduce the embeddings architecture in spaCy. For a more in-depth explanation, you can check out our blogpost, or if you’re already familiar with bloom embeddings, feel free to continue onto the next section.

Traditional embeddings dedicate a vector for each unique symbol in a vocabulary. Typically we have something like a Python dictionary as vocabulary Dict[str, int] that maps tokens to integers. These integers are used to index into a vector table E that has len(vocabulary) number of rows, one for each word in the vocabulary. We map all the tokens we do not have a vector for to the same unknown token, for example, the 0-th vector: E[vocabulary.get(token, 0)].

The goal of Bloom embeddings is to reduce the number of rows in E thereby decreasing the memory footprint of the embedding matrix. To do so we borrow a trick from Bloom filters — hence the name 🌸. Instead of looking up a single vector, we hash each symbol four times and sum the resulting embeddings. Assume that we have an embedding matrix B then the Bloom embedding for a token is computed computed as:

height, width = B.shape
result = np.zeros((width, ))
for hash in hash_funcs:
idx = hash(token) % height
result += B[idx]

That’s it! This way, we reduce the number of rows in the table while still having a very high probability of computing a unique vector for each token. For analysis of the collision probabilities, please checkout Section 3.2 in the technical report.

normal vs. hashed embeddings
Normal vs. hashed embeddings

Other than using the hashing trick, another peculiarity of the embedding layers in spaCy is that they do not embed the raw orthographic forms of the tokens themselves but rather the combination of their normalized form (NORM), their first character (PREFIX), last four characters (SUFFIX) and their shape features (SHAPE). The full embedding architecture is called MultiHashEmbedding in spaCy.

tok2vec embeddings
MultiHashEmbed embeddings

Custom embedding layers

For an apples-to-apples comparison the ner-embeddings project includes the MultiEmbed architecture which is a version of MultiHashEmbed that has all the orthographic features from MultiHashEmbed but replaces Bloom embeddings with traditional embeddings.

If you are interested in trying them out, you first have to run the make-tables command over your data sets to create the lookup tables. It is part of the setup workflow which you can run with python -m spacy project run setup. Once the tables are created MultiEmbed can be included in a pipeline by adding it as the embedder in the config:

@architectures = "spacy.MultiEmbed.v1"
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
width = ${components.tok2vec.model.encode.width}
include_static_vectors = true
unk = 0

You also have to tell spaCy to include the mapping tables in the model before initializing it:

@callbacks = "set_attr"
path = ${paths.tables}
component = "tok2vec"
layer = "multiembed"
attr = "tables"

The set_attr callback is defined in set_attr.py and can be used to inject data into spaCy components after their creation but before they get initialized.

The ner-embedding project also includes yet another embedding layer variant called MultiFewerHashEmbed, which we used to run experiments varying the number of hash functions available to the embedding layer. We do not include these results here in the post, but you can check them out in Section 5.4 in the technical report.

Running experiments conveniently

All performance metrics reported here and in the technical report are averages of runs with three random seeds. The randomness enters the training process of deep learning architectures through random initialization, random generation of dropout masks, random ordering of the data and more. We’ve found the variance to be quite low. Nevertheless, to increase the robustness of the results, we recommend using multiple seeds when possible.

For convenience, we implemented run_experiments.py and collate_results.py in the ner-embeddings project to be able to run multiple architectures with multiple hyperparameter combinations on multiple datasets across multiple random seeds with a single script and to collate all the results after the run for inspection. We recommend that you check out these utilities and hope that they can help with getting better insights into the performance of your pipelines through precise experimentation!

As a side note: We might add a deterministic initialization scheme to thinc in the future, however, the research in this direction is still young. For a fun paper on the effect of randomness in computer vision the cheeky “Torch.manual_seed(3407) is all you need” paper is a nice morning coffee read.

How low can you go?

To see how much we can push the Bloom embeddings to save memory, we’ll look at MultiHashEmbed head-to-head with MultiEmbed on the Spanish CoNLL 2002 and Dutch Archeology NER data sets under memory constraints. We report the results for different setups:

  1. The “Bloom” column uses the default number of rows in spaCy .
  2. The “Traditional” column uses the traditional embeddings with one vector for each symbol that appeared at least 10 times in the corpus.
  3. In the “Bloom 10%” and “Bloom 20%” columns, we use only the reported percentage of the number of vectors available for the Traditional embeddings.

The number of vectors for each feature on each data set for the different architectures is shown in the table below. The first two rows show the number of rows in the vector tables of MultiEmbed for the different data sts. The row MultiHashEmbed shows the default number of rows in spaCy 3.4 which we used for the technical report.

Spanish CoNLL263580114788
Dutch Archeology31321041500174

The following table shows the F1 scores of the same ner pipeline trained with varying the embedding architecture and the number of embeddings:

DatasetBloomTraditionalBloom 20%Bloom 10%
Spanish CoNLL0.770.790.780.78
Dutch Archeology0.830.830.820.80

What we see is that on Spanish CoNLL the results are more or less unaffected, and on Dutch Archeology, the model does seem to incur a slight performance degradation when only using 10% of the vectors. Overall, NER pipelines built on Bloom embeddings remain competitive with those using traditional embeddings even when using ten times less vectors. This result is in line with our previous findings when comparing floret with fastText vectors. Turns out you can go pretty low!

Even though for such small vocabulary sizes the memory usage is very small, let’s just go through the exercise of calculating it for completeness. When training on GPU, we use float32 and on CPU float64. The width of the embedding table is set to 96 by default. We can use numpy to help us calculate how many megabytes the arrays use up. For example for MultiEmbed we have:

rows = 3132 + 104 + 1500 + 17
width = 96
embed64 = numpy.random.random((rows, width))
embed32 = embed64.astype(numpy.float32)
print(embed64.nbytes / 1e+6)
print(embed32.nbytes / 1e+6)

The table below reports the memory usage of the various architectures on the Dutch Archeology data set:

Bloom 20%0.3770.754
Bloom 10%0.1890.377

Saving a couple of megabytes is not a big deal, but using the same technology in floret allows us to reduce the memory consumption of fastext of around 3GB down to about 300MB.

However, one relevant detail we noted during our ner experiments was that we have too many rows in MultiHashEmbed by default for the data sets and languages we considered. Since Bloom embeddings seem robust to the number of rows as of spaCy v.3.5 the default tok2vec architecture HashEmbedCNN simplifies the setup and uses 2000 rows for all four features. This might still seem like too much, but languages like Chinese have a large number of prefixes and data sets based on social media can have a large number of shapes e.g.: the WNUT2017 benchmark has 2103.

Warning: Unseen Entities!

The standard way of evaluating NER pipelines is to create a large data set and randomly split it into training, development and test sets. However, this scenario might overestimate the true generalization capabilities, especially on unseen entities, i.e. entities not present in the training set. This is an important aspect of evaluating ner pipelines because the primary goal of many real-world named entity recognition systems is to identify novel entities.

To get a better picture of the ner performance on various data sets, we took the original test sets and created a seen portion containing only entities that appear in the training set and the complementary unseen portion. The table below shows a subset of our results (F1-score) with the Bloom embedding-based pipeline with additional pre-trained vectors from the en_core_web_lg pipeline:

Dutch CoNLL0.830.70

Across the board, we find a dramatic decrease in F1-scores for Bloom embeddings and we observe the same pattern for traditional embeddings:

Dutch CoNLL0.840.73

This is a pervasive pattern in NER systems: even in the age of BioBERT biomedical named entity recognizers are much better at memorization than generalization (paper). However, it is worth noting that humans also seem to struggle to correctly classify entities when only relying on contextual cues in some data sets (paper).

To help avoid surprises and better evaluate ner pipelines, we included the generate-unseen command in the span-labeling-datasets project, which we used to create the seen and unseen diagnostic sets.

Orthographic Features

We were interested in whether the addition of multiple orthographic features translates to gains in performance in named entity recognition, especially when considering unseen entities. Here we take as an example the Dutch CoNLL data set and report the relative increase in error measured in percentage points as we take features away. This is a common way to get a sense, when comparing a full model — first row — with ablated models — rest of the rows. We compute the F1 score for each model and report -(ablated - full) / (1 - full). For example if the score of the full model is 83% F1 and an ablated model achieves only 73% F1 then the relative increase in error is 58%. The “All” column reports the result on the full data set, while the “Seen” and “Unseen” columns only the entities seen or unseen during training.


The first row in the table above is the default setup of the embedding layer of spaCy utilizing all four features. The subsequent three rows remove each feature one by one, while the final row uses the raw orthographic surface forms of the tokens. The last row corresponds to only using the raw orthographic form.

What we find is that the SHAPE information seems to be mainly beneficial for unseen entities, while the rest of the features are crucial for both seen and unseen entities. In general, we do see more increase in error for seen entities as we take away features because those are what the models tend to capture in the first place.

The value of pre-trained vectors

Let us examine how much we can expect of the pre-trained vectors to mitigate the unseen entity issue. Here is what we’ve found in our experiments (F1-score):

DatasetAll EntitiesAll Entities+ lgUnseenUnseen + lg
Dutch CoNLL73%83%59%70%
Spanish CoNLL77%82%60%74%
Dutch Archeology82%83%35%40%

Across the board we observe the benefit of including the en_core_web_lg vectors in the pipelines. The difference in performance is much more pronounced when considering unseen entities, especially for the larger Dutch Archeology and OntoNotes data sets. As such, we recommend using pre-trained embeddings for NER pipelines when possible.

Before you go

If you are interested in learning more you can find the rest of the results in our technical report. You can also try out your own pipelines on the data sets we used through the span-labeling-datasets. It includes utilities to preprocess all data sets into .spacy format and creates seen/unseen diagnostic splits. To get a bit more information about those data sets beyond what debug data currently provides, you can check out the analyze.py. Unfortunately, we could not include the OntoNotes data set from our paper due to the license not permitting redistribution. Instead, we added a data set we did not use in the technical report called MIT Restaurant Reviews. We are curious what you find out on this data set! The ner-embeddings report project implements all the additional architectures we used for comparison as well as the scripts that helped us run experiments in bulk.

To scrutinize the robustness of named entity recognizers beyond unseen entities our colleague Lj has done experiments creating challenging splits and data perturbations and written a blogpost about the results. The corresponding vs-split library is work-in-progress!

Hope you find our report and the additional tools useful when developing your own pipelines or when familiarizing yourself with the world of NER. We wish you a nice rest of your day and watch out for those unseen entities!