Neural edit-tree lemmatization for spaCy

· by Daniël de Kok · ~12 min. read

We are happy to introduce a new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages. This lemmatizer learns to predict lemmatization rules from a corpus of examples and removes the need to write an exhaustive set of per-language lemmatization rules.

spaCy provides a Lemmatizer component for assigning base forms (lemmas) to tokens. For example, it lemmatizes the sentence

The kids bought treats from various stores.

to its base forms:

the kid buy treat from various store.

Lemmas are useful in many applications. For example, a search engine could use lemmas to match all inflections of a base form. In this way, a query like buy could match its inflections buy, buys, buying, and bought.

Lemma-based query example

English is one of the few languages with a relatively simple inflectional morphology, so ignoring morphology or using a crude approximation like stemming can work decently well for many applications, such as search engines. But for most languages, you need good lemmatization just to get a sensible list of term frequencies.

The spaCy lemmatizer uses two mechanisms for lemmatization for most languages:

  1. A lookup table that maps inflections to their lemmas. For example, the table could specify that buys is lemmatized as buy. The Lemmatizer component also supports lookup tables that are indexed by form and part-of-speech. This allows for different lemmatization of the same orthographic forms that have different word classes. For example, the verbal form chartered in they chartered a plane should be lemmatized as charter, whereas the adjective chartered in a chartered plane should be lemmatized as chartered.
  2. A rule set that rewrites a token to its lemma in certain constrained ways. For example, one specific rule could specify that a token that ends with the suffix -ed and has the part-of-speech tag VERB is lemmatized by removing the suffix -ed. The rules can only operate over the suffix of the token, so are only suitable for simple morphological systems that are mostly concatenative, such as English.

These mechanisms can also combined. For instance, a lookup table could be used for irregular forms and a set of rules for regular forms.

The accuracy of the Lemmatizer component on a particular language depends on how comprehensive the lookup table and rule set for that language is. Developing a comprehensive rule set requires a fair amount of labor, even for linguists who are familiar with the language.

Edit trees

Since corpora with lemma annotations are available for many languages, it would be more convenient if a lemmatizer could infer lemmatization rules automatically from a set of examples. Consider for example the Dutch past participle form gepakt and its lemma pakken (to take). It is fairly straightforward to come up with a rule for lemmatizing gepakt:

  1. Find the longest common substring (LCS) of inflected form and its lemma: gepaktpakken. The longest common substring often captures the stem of the words.
  2. Split the inflected form and the lemma in three parts: the prefix, the LCS, and the suffix.
  3. Find the changes that need to be made to the prefix and suffix to go from the inflected form to the lemma:
    a. Replace the prefix ge- by the empty sting ε
    b. Replace the suffix -t by the string -ken

3a and 3b would then together form a single lemmatization rule that works for (most) regularly-inflected Dutch past participles that have the general form: ge- [stem-ending-in-k] -t, such as gepakt or gelekt.

In practice, the rule-finding algorithm is a bit more complex, since there may be multiple shared substrings. For example, the Dutch verb afpakken (to take away) contains a separable verb prefix af-. Its past participle is afgepakt, so the past participle and the lemma have two shared substrings: afgepaktafpakken. This is accounted for by using a recursive version of the algorithm above. Rather than simply replacing the string afge by af, we apply the algorithm to these two substrings as well.

This recursive algorithm and the corresponding rule representation were proposed in Joint Lemmatization and Morphological Tagging with Lemming (Thomas Müller et al., 2015). The recursive data structure that the algorithm produces is a so-called edit tree. Edit trees have two types of nodes:

Edit tree node legend

You could see these two types of nodes as small functions:

  • Interior node: splits a string into three parts: 1. a prefix of length n; 2. an infix; and 3. a suffix of length m. Then it applies its left child to the prefix and its right child to the suffix. Finally, it returns the concatenation of the transformed prefix, the infix, and the transformed suffix.
  • Leaf node: checks that the input string is s (otherwise, the tree is not applicable) and if so, returns t.

These two node types can be combined into a tree, which recursively rewrites string prefixes and suffixes, while retaining infixes (which are substrings shared by the token and its lemma). Below, you will find the edit tree that is the result of applying the rule construction algorithm to the pair afgepakt and afpakken.

Edit tree example

The grey nodes represent the edit tree itself. The purple and orange edges show the prefixes and suffixes that are the inputs to the tree nodes when the tree is applied to afgepakt. The black edges show the outputs of the tree nodes.

One nice property of edit trees is that they leave out as much of the surface form as possible. For this reason, the edit tree also generalizes to other verbs with the same inflectional pattern, such as afgeplakt (taped) or even opgepakt (picked up or arrested).

Learning to predict edit trees

Given a large corpus where tokens are annotated with their lemmas, we can use the algorithm discussed earlier to extract an edit tree for each token - lemma pair. This typically results in hundreds or thousands of unique edit trees for a reasonably-sized corpus. The number of edit trees is much smaller than the number of types (unique words), since most words are inflected following regular patterns. However, how do we know which edit tree to apply when we are asked to lemmatize a token?

Treating the task of picking the right edit tree as a classification task turns out to work surprisingly well. In this approach, each edit tree is considered to be a class and we use a Softmax layer to compute a probability distribution over all trees for a particular token. We can then apply the most-probable edit tree to lemmatize the token. If the most probably tree cannot be applied, there is the option to back off to the next most probable tree.

The quality of the predictions is largely determined by the hidden representations that are provided to the softmax layer. These representations should encode both subword and contextual information:

  • Subword information is relevant for choosing a tree that is applicable to the surface form. For instance, it does not make sense to apply the edit tree that was discussed above to tokens without the infix -ge-, the suffix -t, or a two-letter separable verb particle such as af.
  • Contextual information is needed to disambiguate surface forms. In many languages, the inflectional affixes are specific to a part-of-speech. So, in order to pick the correct edit tree for a token, a model also needs to infer its part-of-speech. For instance, walking in She is walking should be lemmatized as walk, whereas walking in I bought new walking shoes has walking as its lemma.
  • Sometimes it is also necessary to disambiguate the word sense in order to choose the correct edit tree. For example, axes can either be the plural of the noun axis or the plural of the noun axe. In order to pick the correct lemmatization, a model would first need to infer from the context which sense of axes was used.

Luckily, the venerable HashEmbedCNN layer provides both types of information to the classifier, providing word and subword representations through the MultiHashEmbed layer and contextual information through the MaxoutWindowEncoder layer. Another good option for integrating both types of information are transformer models provided through spacy-transformers.

How well does it work?

We have created a new experimental_edit_tree_lemmatizer component that combines the techniques discussed in this post. We have also done experiments on several languages to gauge how well this lemmatizer works. In these experiments, we trained some pipelines with the tok2vec, tagger (where applicable), morphologizer components and the default spaCy lemmatizer or the new edit tree lemmatizer. The accuracies, as well as the CPU prediction speeds in words per second (WPS), are shown in the table below:

LanguageVectorsLemmatizer AccuracyLemmatizer Speed1Edit Tree Lemmatizer AccuracyEdit Tree Lemmatizer Speed1
nlxlm-roberta-base (transformer)0.861,7720.981,712
plxlm-roberta-base (transformer)0.881,6310.971,554
  1. Speeds are in words per second (WPS), measured on the test set using three evaluation passes as warmup.

For the tested languages, the edit tree lemmatizer provides considerable improvements, generally posting accuracies above 95%.

We configured the edit tree lemmatizer to share the same token representations as the other components in the pipeline, which means the benefits of the edit tree lemmatizer are especially clear if you’re using a transformer model. Transformers take longer to run, so the edit tree lemmatizer will have a fractionally lower impact on the total runtime of the pipeline. Transformers also supply more informative token representations, increasing the edit tree lemmatizer’s accuracy advantage over the rule-based lemmatizer.

Trying out the edit tree lemmatizer

We should emphasize that the edit tree lemmatizer component is currently still experimental. However, thanks to the function registry support in spaCy v3, it is easy to try out the new lemmatizer in your own pipelines. First install the spacy-experimental Python package:

Installing the spacy-experimental packagepip install -U pip setuptools wheel
pip install spacy-experimental==0.2.0

You can then use the experimental_edit_tree_lemmatizer component factory:

Basic edit tree lemmatizer configuration[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"

That’s all! Of course, we encourage you to experiment with more than the default model. First of all, you can change the behavior of the edit tree lemmatizer using the options described in the table below:

backoffThe token attribute that must be used when the lemmatizer fails to find an applicable edit tree. The default is to use the orth attribute to get the orthographical form.
min_tree_freqThe required minimum frequency of an edit tree in the training data to be included in the model.
top_kThe number of most probable trees that should be tried for lemmatization before resorting to the backoff attribute.
overwriteIf enabled, the lemmatizer will overwrite lemmas set by previous components in the pipeline.

Secondly, you can also share hidden representations between the edit tree lemmatizer and other components by using Tok2VecListener, as shown in the example below. In many cases, joint training with components that perform morphosyntactic annotation, such as Tagger or Morphologizer, can improve the accuracy of the lemmatizer.

Edit tree lemmatizer configuration that uses a shared tok2vec component[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"
backoff = "orth"
min_tree_freq = 3
overwrite = false
top_k = 1

@architectures = "spacy.Tagger.v1"
nO = null

@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "tok2vec"

Sample project

If you would rather like to try a ready-to-use example project to start out, you can use the example project for the edit tree lemmatizer. You can fetch this project with spaCy’s project command and install the necessary dependencies:

Get the sample project and install dependenciespython -m spacy project clone projects/edit_tree_lemmatizer \
cd edit_tree_lemmatizer
pip install -r requirements.txt

The training and evaluation data can be downloaded with the project assets command. The lemmatizer can then be trained and evaluated using the run all workflow. The project uses the Dutch Alpino treebank as provided by the Universal Dependencies project by default. So, the following commands will train and evaluate a Dutch lemmatizer:

Fetch data and train a lemmatization modelpython -m spacy project assets
python -m spacy project run all

You can edit the config to try out different settings or change the pipeline to your requirements, edit the project.yml file to use different data or add preprocessing steps, and use spacy project push and spacy project pull to persist intermediate results to a remote storage and share them amongst your team.

You can help!

We have made this new lemmatizer available through spacy-experimental, our package with experimental spaCy components. In the future, we would like to move the functionality of the edit tree lemmatizer into spaCy. You can help make this happen by trying out the edit tree lemmatizer and posting your experiences and feedback to the spaCy discussion forums.

About the author

  • Daniël de Kok Machine Learning Engineer, spaCy Core Developer

    • Groningen, Netherlands