We are happy to introduce a new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages. This lemmatizer learns to predict lemmatization rules from a corpus of examples and removes the need to write an exhaustive set of per-language lemmatization rules.
spaCy provides a Lemmatizer
component for
assigning base forms (lemmas) to tokens. For example, it lemmatizes the
sentence
The kids bought treats from various stores.
to its base forms:
the kid buy treat from various store.
Lemmas are useful in many applications. For example, a search engine could use lemmas to match all inflections of a base form. In this way, a query like buy could match its inflections buy, buys, buying, and bought.
English is one of the few languages with a relatively simple inflectional morphology, so ignoring morphology or using a crude approximation like stemming can work decently well for many applications, such as search engines. But for most languages, you need good lemmatization just to get a sensible list of term frequencies.
The spaCy lemmatizer uses two mechanisms for lemmatization for most languages:
- A lookup table that maps inflections to their lemmas. For example, the
table could specify that buys is lemmatized as buy. The
Lemmatizer
component also supports lookup tables that are indexed by form and part-of-speech. This allows for different lemmatization of the same orthographic forms that have different word classes. For example, the verbal form chartered in they chartered a plane should be lemmatized as charter, whereas the adjective chartered in a chartered plane should be lemmatized as chartered. - A rule set that rewrites a token to its lemma in certain constrained ways. For example, one specific rule could specify that a token that ends with the suffix -ed and has the part-of-speech tag VERB is lemmatized by removing the suffix -ed. The rules can only operate over the suffix of the token, so are only suitable for simple morphological systems that are mostly concatenative, such as English.
These mechanisms can also be combined. For instance, a lookup table could be used for irregular forms and a set of rules for regular forms.
The accuracy of the Lemmatizer
component on a particular language depends on
how comprehensive the lookup table and rule set for that language is. Developing
a comprehensive rule set requires a fair amount of labor, even for linguists who
are familiar with the language.
Edit trees
Since corpora with lemma annotations are available for many languages, it would be more convenient if a lemmatizer could infer lemmatization rules automatically from a set of examples. Consider for example the Dutch past participle form gepakt and its lemma pakken (to take). It is fairly straightforward to come up with a rule for lemmatizing gepakt:
- Find the longest common substring (LCS) of inflected form and its lemma: gepakt — pakken. The longest common substring often captures the stem of the words.
- Split the inflected form and the lemma in three parts: the prefix, the LCS, and the suffix.
- Find the changes that need to be made to the prefix and suffix to go from the
inflected form to the lemma:
a. Replace the prefix ge- by the empty sting ε
b. Replace the suffix -t by the string -ken
3a and 3b would then together form a single lemmatization rule that works for (most) regularly-inflected Dutch past participles that have the general form: ge- [stem-ending-in-k] -t, such as gepakt or gelekt.
In practice, the rule-finding algorithm is a bit more complex, since there may be multiple shared substrings. For example, the Dutch verb afpakken (to take away) contains a separable verb prefix af-. Its past participle is afgepakt, so the past participle and the lemma have two shared substrings: afgepakt — afpakken. This is accounted for by using a recursive version of the algorithm above. Rather than simply replacing the string afge by af, we apply the algorithm to these two substrings as well.
This recursive algorithm and the corresponding rule representation were proposed in Joint Lemmatization and Morphological Tagging with Lemming (Thomas Müller et al., 2015). The recursive data structure that the algorithm produces is a so-called edit tree. Edit trees have two types of nodes:
You could see these two types of nodes as small functions:
- Interior node: splits a string into three parts: 1. a prefix of length n; 2. an infix; and 3. a suffix of length m. Then it applies its left child to the prefix and its right child to the suffix. Finally, it returns the concatenation of the transformed prefix, the infix, and the transformed suffix.
- Leaf node: checks that the input string is s (otherwise, the tree is not applicable) and if so, returns t.
These two node types can be combined into a tree, which recursively rewrites string prefixes and suffixes, while retaining infixes (which are substrings shared by the token and its lemma). Below, you will find the edit tree that is the result of applying the rule construction algorithm to the pair afgepakt and afpakken.
The grey nodes represent the edit tree itself. The purple and orange edges show the prefixes and suffixes that are the inputs to the tree nodes when the tree is applied to afgepakt. The black edges show the outputs of the tree nodes.
One nice property of edit trees is that they leave out as much of the surface form as possible. For this reason, the edit tree also generalizes to other verbs with the same inflectional pattern, such as afgeplakt (taped) or even opgepakt (picked up or arrested).
Learning to predict edit trees
Given a large corpus where tokens are annotated with their lemmas, we can use the algorithm discussed earlier to extract an edit tree for each token - lemma pair. This typically results in hundreds or thousands of unique edit trees for a reasonably-sized corpus. The number of edit trees is much smaller than the number of types (unique words), since most words are inflected following regular patterns. However, how do we know which edit tree to apply when we are asked to lemmatize a token?
Treating the task of picking the right edit tree as a classification task turns
out to work surprisingly well. In this approach, each edit tree is considered to
be a class and we use a Softmax
layer to compute a probability distribution
over all trees for a particular token. We can then apply the most-probable edit
tree to lemmatize the token. If the most probably tree cannot be applied, there
is the option to back off to the next most probable tree.
The quality of the predictions is largely determined by the hidden representations that are provided to the softmax layer. These representations should encode both subword and contextual information:
- Subword information is relevant for choosing a tree that is applicable to the surface form. For instance, it does not make sense to apply the edit tree that was discussed above to tokens without the infix -ge-, the suffix -t, or a two-letter separable verb particle such as af.
- Contextual information is needed to disambiguate surface forms. In many languages, the inflectional affixes are specific to a part-of-speech. So, in order to pick the correct edit tree for a token, a model also needs to infer its part-of-speech. For instance, walking in She is walking should be lemmatized as walk, whereas walking in I bought new walking shoes has walking as its lemma.
- Sometimes it is also necessary to disambiguate the word sense in order to choose the correct edit tree. For example, axes can either be the plural of the noun axis or the plural of the noun axe. In order to pick the correct lemmatization, a model would first need to infer from the context which sense of axes was used.
Luckily, the venerable
HashEmbedCNN
layer provides
both types of information to the classifier, providing word and subword
representations through the
MultiHashEmbed
layer and
contextual information through the
MaxoutWindowEncoder
layer. Another good option for integrating both types of information are
transformer models provided through
spacy-transformers
.
How well does it work?
We have created a new experimental_edit_tree_lemmatizer
component that
combines the techniques discussed in this post. We have also done experiments on
several languages to gauge how well this lemmatizer works. In these experiments,
we trained some pipelines with the tok2vec
, tagger
(where applicable),
morphologizer
components and the default spaCy lemmatizer or the new edit tree
lemmatizer. The accuracies, as well as the CPU prediction speeds in words per
second (WPS), are shown in the table below:
Language | Vectors | Lemmatizer Accuracy | Lemmatizer Speed1 | Edit Tree Lemmatizer Accuracy | Edit Tree Lemmatizer Speed1 |
---|---|---|---|---|---|
de | de_core_news_lg | 0.70 | 39,567 | 0.97 | 31,043 |
es | es_core_news_lg | 0.98 | 46,388 | 0.99 | 39,018 |
it | it_core_news_lg | 0.86 | 43,397 | 0.97 | 33,419 |
nl | nl_core_news_lg | 0.86 | 51,395 | 0.96 | 40,421 |
pl | pl_core_news_lg | 0.87 | 17,920 | 0.94 | 15,429 |
pt | pt_core_news_lg | 0.76 | 45,097 | 0.97 | 39,783 |
nl | xlm-roberta-base (transformer) | 0.86 | 1,772 | 0.98 | 1,712 |
pl | xlm-roberta-base (transformer) | 0.88 | 1,631 | 0.97 | 1,554 |
- Speeds are in words per second (WPS), measured on the test set using three evaluation passes as warmup.
For the tested languages, the edit tree lemmatizer provides considerable improvements, generally posting accuracies above 95%.
We configured the edit tree lemmatizer to share the same token representations as the other components in the pipeline, which means the benefits of the edit tree lemmatizer are especially clear if you’re using a transformer model. Transformers take longer to run, so the edit tree lemmatizer will have a fractionally lower impact on the total runtime of the pipeline. Transformers also supply more informative token representations, increasing the edit tree lemmatizer’s accuracy advantage over the rule-based lemmatizer.
Trying out the edit tree lemmatizer
We should emphasize that the edit tree lemmatizer component is currently still
experimental. However, thanks to the
function registry support in
spaCy v3, it is easy to try out the new lemmatizer in your own pipelines. First
install the
spacy-experimental
Python
package:
Installing the spacy-experimental package
pip install -U pip setuptools wheelpip install spacy-experimental==0.4.0
You can then use the experimental_edit_tree_lemmatizer
component factory:
Basic edit tree lemmatizer configuration
[components.experimental_edit_tree_lemmatizer]factory = "experimental_edit_tree_lemmatizer"
That’s all! Of course, we encourage you to experiment with more than the default model. First of all, you can change the behavior of the edit tree lemmatizer using the options described in the table below:
backoff | The token attribute that must be used when the lemmatizer fails to find an applicable edit tree. The default is to use the orth attribute to get the orthographical form. |
min_tree_freq | The required minimum frequency of an edit tree in the training data to be included in the model. |
top_k | The number of most probable trees that should be tried for lemmatization before resorting to the backoff attribute. |
overwrite | If enabled, the lemmatizer will overwrite lemmas set by previous components in the pipeline. |
Secondly, you can also share hidden representations between the edit tree
lemmatizer and other components by using
Tok2VecListener
, as
shown in the example below. In many cases, joint training with components that
perform morphosyntactic annotation, such as
Tagger
or
Morphologizer
, can improve the accuracy
of the lemmatizer.
Edit tree lemmatizer configuration that uses a shared tok2vec component
[components.experimental_edit_tree_lemmatizer]factory = "experimental_edit_tree_lemmatizer"backoff = "orth"min_tree_freq = 3overwrite = falsetop_k = 1[components.experimental_edit_tree_lemmatizer.model]@architectures = "spacy.Tagger.v1"nO = null[components.experimental_edit_tree_lemmatizer.model.tok2vec]@architectures = "spacy.Tok2VecListener.v1"width = ${components.tok2vec.model.encode.width}upstream = "tok2vec"
Sample project
If you would rather like to try a ready-to-use example project to start out, you
can use the
example project
for the edit tree lemmatizer. You can fetch this project with spaCy’s
project
command and install the necessary
dependencies:
Get the sample project and install dependencies
python -m spacy project clone projects/edit_tree_lemmatizer \--repo https://github.com/explosion/spacy-experimental \--branch v0.4.0cd edit_tree_lemmatizerpip install spacy-experimental==0.4.0
The training and evaluation data can be downloaded with the
project assets
command. The
lemmatizer can then be trained and evaluated using the run all
workflow. The
project uses the Dutch Alpino treebank as provided by the
Universal Dependencies project by default.
So, the following commands will train and evaluate a Dutch lemmatizer:
Fetch data and train a lemmatization model
python -m spacy project assetspython -m spacy project run all
You can edit the config to try out different settings or change the pipeline to
your requirements, edit the project.yml
file to use different data or add
preprocessing steps, and use spacy project push
and spacy project pull
to
persist intermediate results to a remote storage and share them amongst your
team.
You can help!
We have made this new lemmatizer available through
spacy-experimental
, our
package with experimental spaCy components. In the future, we would like to move
the functionality of the edit tree lemmatizer into spaCy. You can help make this
happen by trying out the edit tree lemmatizer and posting your experiences and
feedback to the
spaCy discussion forums.