Many people have asked us to make spaCy available for their language. Being based in Berlin, German was an obvious choice for our first second language. Now SpaCy can do all the cool things you use for processing English on German text too. But more importantly, teaching spaCy to speak German required us to drop some comfortable but English-specific assumptions about how language works and made spaCy fit to learn more languages in the future.
The current release features high-accuracy syntactic dependency parsing, named entity recognition, part-of-speech tagging, token and sentence segmentation, and noun phrase chunking. It also comes with word vectors representations, produced from word2vec. As you’ll see below, installation and usage work much the same for both German and English. However, there are some small differences, that follow from the two languages’ differing linguistic structure.
On the evolutionary tree of languages, German and English are close cousins, on the Germanic branch of the Indo-European family. They share a relatively recent common ancestor, so they’re structurally similar. And where they differ, it’s mostly English that’s weird, not German. The algorithmic changes needed to process German are an important step towards processing many other languages.
English has very simple rules for word formation (aka morphology), and very strict rules for word order. This means that an English-only NLP system can get away with some very useful simplifying assumptions. German is the perfect language to unwind these. German word order and morphology are still relatively restricted, so we can make the necessary algorithmic changes without being overwhelmed by the additional complexity.
When Germans learn English in school, one of the first things they are taught to memorize is Subject-Verb-Object or SVO. In English, the subject comes first, then the verb, then the object. If you change that order, the meaning of the sentence changes. The dog bites the man means something different from The man bites the dog even though both sentences use the exact same words. In German — as in many other languages — this is not the case. German allows for any order of subject and object in a sentence and only restricts the position of the verb. In German, you can say Der Hund beißt den Mann and Den Mann beißt der Hund and both sentences mean The dog bites the man.
One of the more difficult things for people who learn German as a second language is to figure out where to put the verb. German verbs are usually at the end of a sentence, under certain circumstances, the verb or a part of it moves to the first or second position. For instance, compare the English sentence in the example below to its German counterpart. While all the parts of the English verb stay together, the German verb is distributed over the sentence. The main part (the one carrying the meaning) is at the end of the sentence and the other part (the auxiliary verb) comes in second position after the subject.
The fact that German verbs come at the end of the sentence, or are split as in the example above, has some implications for language understanding technologies. So far, the syntactic structures that spaCy predicted for English sentences were always projective, which means that you could draw them without ever having to cross two arcs. In order to accommodate languages with less restrictive word order than English — for example German — the parser now also predicts non-projective structures, i.e., structures where arcs may cross.
To illustrate the difference, consider the example below. We want the syntactic structure to represent the fact that it is the flight that was booked the day before, hence we want the parser to predict an arc between flight and booked. And of course we want the parser to predict the same arc also for the German counterpart, in this case between Flug and gebucht habe. However, because the German verb comes last, there can be crossing arcs in the German structure. This is not the only type of German construction that leaves us with a non-projective parse, but it is a frequent one. Moreover, unlike some cases where you could change your linguistic theory to avoid the crossing arcs, this one is well motivated by data and very difficult to avoid without losing information.
To summarize the above, we need non-projective trees to represent the information that we are interested in when parsing natural language. Okay, so we want crossing arcs. What’s the problem? The problem is that crossing arcs force us to give up a very useful constraint. The set of possible non-projective trees is considerably larger than the set of possible projective trees. What’s more, the algorithm spaCy uses to search for a projective tree is both simpler and more efficient than the equivalent non-projective algorithms, so restricting spaCy to projective dependency parsing has given us a win on two fronts: we’ve been able to do less work computationally, while also encoding important prior knowledge about the problem space into the system.
Unfortunately, this “prior knowledge” was never quite true. It’s a simplifying assumption. In the same way that a physicist might assume a frictionless surface or a spherical cow, sometimes it’s useful for computational linguists to assume projective trees and context-free grammars. For English, projective trees are a good-value simplification — the cow of English is not quite a perfect sphere, but it’s close. The cow of German is considerably less round, and we can make our model more accurate by taking this into account.
Luckily for us, the problem of predicting non-projective structures has received a lot of attention over the last decade. One observation that was made early on is that these non-projective arcs are rare. Usually, only a few percent of the arcs in linguistic treebanks are non-projective, even for languages with unrestrictive word order. This means that we can afford to use approaches with a higher worst-case complexity because the worst case basically never occurs and therefore has virtually no impact on the efficiency of our NLP systems.
Several methods have been proposed for dealing with non-projective arcs. Most change the parsing algorithm to search through the full space of possible structures directly, or at least a large part of it. In spaCy, we opted for a more indirect approach. Nivre and Nilsson (2005) propose a simple procedure they call pseudo-projective parsing. The parser is trained on projective structures that are produced from non-projective structures by reattaching the non-projective arcs higher in the tree until they are projective. The original attachment site is encoded in the label of the respective arc. The parser thus learns to predict projective structures with specially decorated arc labels. The output of the parser is then post-processed to reattach decorated arcs to their proper syntactic head according to their arc label, thereby re-introducing non-projective arcs.
Using pseudo-projective parsing allows spaCy to produce non-projective structures without having to sacrifice the efficient parsing algorithm, which is restricted to projective structures. And because non-projective arcs are rare, the post-processing step only ever has to reattach one or two arcs in every other sentence, which makes its impact on the overall parsing speed negligable even though its worst case complexity is higher than the parser’s. In fact, we didn’t notice any difference in speed when parsing German with this approach. And when we know that our training data is projective, we just switch it off.
Pseudo-projective parsing makes a big difference in German because the parser can recover arcs that a purely projective model cannot. The numbers in the table show the percentage of arcs that were correctly attached by the parser (unlabeled attachment score (UAS) ignores the label, labeled attachment score (LAS) takes it into account). We train and evaluate the German model on the TiGer treebank (see below).
|German, forcing projective structures||90.86%||88.60%|
|German, allowing non-projective structures||92.22%||90.14%|
One other important difference between English and German is the richer morphology of German words. German words can change their form depending on their grammatical function in a sentence. English words do this too, for example by appending an s to a noun to mark plural (ticket → tickets). However, in most languages word forms of the same word show much more variety than in English, and this is also the case in German. German is also famous for its capacity to form really long words, another process that is driven by the morphological system.
While German is clearly a language with rich morphology, it isn’t the most crucial aspect for natural language processing of German (depending on the task, of course 😉). While processing languages like Hungarian, Turkish, or Czech is hopeless without a proper treatment of morphological processes, German can be processed reasonably well without. We therefore released the German model without a morphological component — for now. We’re working on adding such a component to spaCy, not just for improving the German model but also to make the next step towards learning more languages.
As for English, spaCy now provides a pretrained model for processing German.
This model currently provides functionality for tokenization, part-of-speech
tagging, syntactic parsing, and named entity recognition. In addition,
spacy.de also comes with pre-trained word representations, in the form of word
vectors and hierarchical cluster IDs.
Installing the German model on your machine is as easy as for English:
Once installed you can use it from Python like the English model. If you’ve been
loading spaCy using the
English() class directly, now’s a good time to switch
over to the newer
As for English, German provides named entities and a noun chunk iterator to extract basic information from the data. The NER model can currently distinguish persons, locations, and organizations. We are currently looking into ways of extending this to more classes.
The noun chunk iterator provides easy access to base noun phrases in the form of an iterator. The iterator requires the dependency structure to be present and returns all noun phrases that the parser recognized.
The German model comes with word vectors trained on a mix of text from Wikipedia and the Open Subtitles corpus. The vectors were produced using the skip-gram with negative sampling word2vec algorithm using Gensim, with a context window of 2.
You can use the vector representation with the
.vector attribute and the
.similarity() method on spaCy’s
While we try to always provide good defaults in spaCy, the word2vec family of algorithms give you a lot of knobs to twiddle, so you might benefit from custom trained vectors.
With the German parser potentially returning non-projective structures, some assumptions about syntactic structures that would hold for the English parser don’t hold for the German one. For example, the subtree of a particular token doesn’t necessarily span a consecutive substring of the input sentence anymore. Furthermore, a token may have no direct left dependents but can still have a left edge (the left-most descendant of the token) that is further left of the token.
The German model is trained on the German TiGer treebank converted to dependencies. The language-specific part-of-speech tags use the Stuttgart-Tübingen Tag Set (STTS) (document in German). The model for named entity recognition is trained on the German Named Entity Recognition Data from the TU Darmstadt. For estimating word probabilities we rely on data provided by the COW project. Word vectors and Brown clusters are computed on a combination of the German Wikipedia and the German part of OpenSubtitles2016 which is based on data from opensubtitles.org. Buy these people a beer and a cookie when you meet them 😊
This post was originally published on spacy.io.