A syntactic parser describes a sentence’s grammatical structure, to help another application reason about it. Natural languages introduce many unexpected ambiguities, which our world-knowledge immediately filters out. A favourite example:
A correct parse links “with” to “pizza”, while an incorrect parse links “with” to “eat”:
The Natural Language Processing (NLP) community has made big progress in syntactic parsing over the last few years. It’s now possible for a tiny Python implementation to perform better than the widely-used Stanford PCFG parser.
Note: I wasn’t really sure how to count the lines of code in the Stanford parser. Its jar file ships over 200k, but there are a lot of different models in it. It’s not important, but it’s certainly over 4k.
The rest of the post sets up the problem, and then takes you through a concise implementation, prepared for this post. The first 200 lines of parser.py, the part-of-speech tagger and learner, are described here. You should probably at least skim that post before reading this one, unless you’re very familiar with NLP research.
The Cython system, Redshift, was written for my current research. I plan to improve it for general use in June, after my contract ends at Macquarie University. The current version is hosted on GitHub.
It’d be nice to type an instruction like this into your phone:
And have it set the appropriate policy. On Android you can do this sort of thing with Tasker, but an NL interface would be much better. It’d be especially nice to receive a meaning representation you could edit, so you could see what it thinks you said, and correct it.
There are lots of problems to solve to make that work, but some sort of syntactic representation is definitely necessary. We need to know that:
is another way of phrasing the first instruction, while:
means something completely different.
A dependency parser returns a graph of word-word relationships, intended to make such reasoning easier. Our graphs will be trees — edges will be directed, and every node (word) will have exactly one incoming arc (one dependency, with its head), except one.
The idea is that it should be slightly easier to reason from the parse, than it was from the string. The parse-to-meaning mapping is hopefully simpler than the string-to-meaning mapping.
The most confusing thing about this problem area is that “correctness” is defined by convention — by annotation guidelines. If you haven’t read the guidelines and you’re not a linguist, you can’t tell whether the parse is “wrong” or “right”, which makes the whole task feel weird and artificial.
For instance, there’s a mistake in the parse above: “John’s school calls” is structured wrongly, according to the Stanford annotation guidelines. The structure of that part of the sentence is how the annotators were instructed to parse an example like “John’s school clothes”.
It’s worth dwelling on this point a bit. We could, in theory, have written our guidelines so that the “correct” parses were reversed. There’s good reason to believe the parsing task will be harder if we reversed our convention, as it’d be less consistent with the rest of the grammar. But we could test that empirically, and we’d be pleased to gain an advantage by reversing the policy.
We definitely do want that distinction in the guidelines — we don’t want both to receive the same structure, or our output will be less useful. The annotation guidelines strike a balance between what distinctions downstream applications will find useful, and what parsers will be able to predict easily.
There’s a particularly useful simplification that we can make, when deciding what we want the graph to look like: we can restrict the graph structures we’ll be dealing with. This doesn’t just give us a likely advantage in learnability; it can have deep algorithmic implications. We follow most work on English in constraining the dependency graphs to be projective trees:
- Tree. Every word has exactly one head, except for the dummy ROOT symbol.
- Projective. For every pair of dependencies (a1, a2) and (b1, b2), if a1 < b1, then a2 >= b2. In other words, dependencies cannot “cross”. You can’t have a pair of dependencies that goes a1 b1 a2 b2, or b1 a1 b2 a2.
There’s a rich literature on parsing non-projective trees, and a smaller literature on parsing DAGs. But the parsing algorithm I’ll be explaining deals with projective trees.
Our parser takes as input a list of string tokens, and outputs a list of head indices, representing edges in the graph. If the ith member of heads is _j_, the dependency parse contains an edge (j, i). A transition-based parser is a finite-state transducer; it maps an array of N words onto an output array of N head indices:
The heads array denotes that the head of MSNBC is reported: MSNBC is word
1, and reported is word 2, and
heads[1 == 2]. You can already see why
parsing a tree is handy — this data structure wouldn’t work if we had to output
a DAG, where words may have multiple heads.
heads can be represented as an array, we’d actually like to maintain
some alternate ways to access the parse, to make it easy and efficient to
extract features. Our
Parse class looks like this:
As well as the parse, we also have to keep track of where we’re up to in the
sentence. We’ll do this with an index into the
words array, and a stack, to
which we’ll push words, before popping them once their head is set. So our state
data structure is fundamentally:
- An index, i, into the list of tokens;
- The dependencies added so far, in Parse
- A stack, containing words that occurred before i, for which we’re yet to assign a head.
Each step of the parsing process applies one of three actions to the state:
RIGHT actions add dependencies and pop the stack, while
pushes the stack and advances i into the buffer.
So, the parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the (valid) actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. (It’s hard to understand this sort of algorithm without stepping through it. Try coming up with a sentence, drawing a projective parse tree over it, and then try to reach the parse tree by choosing the right sequence of transitions.)
Here’s what the parsing loop looks like in code:
We start by tagging the sentence, and initializing the state. We then map the state to a set of features, which we score using a linear model. We then find the best-scoring valid move, and apply it to the state.
The model scoring works the same as it did in the POS tagger. If you’re confused about the idea of extracting features and scoring them with a linear model, you should review that post. Here’s a reminder of how the model scoring works:
It’s just summing the class-weights for each feature. This is often expressed as a dot-product, but when you’re dealing with multiple classes, that gets awkward, I find.
The beam parser (RedShift) tracks multiple candidates, and only decides on the best one at the very end. We’re going to trade away accuracy in favour of efficiency and simplicity. We’ll only follow a single analysis. Our search strategy will be entirely greedy, as it was with the POS tagger. We’ll lock-in our choices at every step.
If you read the POS tagger post carefully, you might see the underlying similarity. What we’ve done is mapped the parsing problem onto a sequence-labelling problem, which we address using a “flat”, or unstructured, learning algorithm (by doing greedy search).
Feature extraction code is always pretty ugly. The features for the parser refer to a few tokens from the context:
- The first three words of the buffer (n0, n1, n2)
- The top three words of the stack (s0, s1, s2)
- The two leftmost children of s0 (s0b1, s0b2);
- The two rightmost children of s0 (s0f1, s0f2);
- The two leftmost children of n0 (n0b1, n0b2)
For these 12 tokens, we refer to the word-form, the part-of-speech tag, and the number of left and right children attached to the token.
Because we’re using a linear model, we have our features refer to pairs and triples of these atomic properties.
Weights are learned using the same algorithm, averaged perceptron, that we used for part-of-speech tagging. Its key strength is that it’s an online learning algorithm: examples stream in one-by-one, we make our prediction, check the actual answer, and adjust our beliefs (weights) if we were wrong.
The training loop looks like this:
The most interesting part of the training process is in
performance of our parser is made possible by an advance by Goldberg and Nivre
(2012), who showed that we’d been doing this wrong for years.
In the POS-tagging post, I cautioned that during training you need to make sure you pass in the last two predicted tags as features for the current tag, not the last two gold tags. At test time you’ll only have the predicted tags, so if you base your features on the gold sequence during training, your training contexts won’t resemble your test-time contexts, so you’ll learn the wrong weights.
In parsing, the problem was that we didn’t know how to pass in the predicted sequence! Training worked by taking the gold-standard tree, and finding a transition sequence that led to it. i.e., you got back a sequence of moves, with the guarantee that if you followed those moves, you’d get the gold-standard dependencies.
The problem is, we didn’t know how to define the “correct” move to teach a parser to make if it was in any state that wasn’t along that gold-standard sequence. Once the parser had made a mistake, we didn’t know how to train from that example.
That was a big problem, because it meant that once the parser started making mistakes, it would end up in states unlike any in its training data – leading to yet more mistakes. The problem was specific to greedy parsers: once you use a beam, there’s a natural way to do structured prediction.
The solution seems obvious once you know it, like all the best breakthroughs. What we do is define a function that asks “How many gold-standard dependencies can be recovered from this state?”. If you can define that function, then you can apply each move in turn, and ask, “How many gold-standard dependencies can be recovered from this state?”. If the action you applied allows fewer gold-standard dependencies to be reached, then it is sub-optimal.
That’s a lot to take in.
So we have this function
We also have a set of actions, each of which returns a new state. We want to know:
shift_cost = Oracle(state) – Oracle(shift(state))
right_cost = Oracle(state) – Oracle(right(state))
left_cost = Oracle(state) – Oracle(left(state))
Now, at least one of those costs has to be zero. Oracle(state) is asking, “what’s the cost of the best path forward?”, and the first action of that best path has to be shift, right, or left.
It turns out that we can derive Oracle fairly simply for many transition systems. The derivation for the transition system we’re using, Arc Hybrid, is in Goldberg and Nivre (2013).
We’re going to implement the oracle as a function that returns the zero-cost moves, rather than implementing a function Oracle(state). This prevents us from doing a bunch of costly copy operations. Hopefully the reasoning in the code isn’t too hard to follow, but you can also consult Goldberg and Nivre’s papers if you’re confused and want to get to the bottom of this.
Doing this “dynamic oracle” training procedure makes a big difference to accuracy — typically 1-2%, with no difference to the way the run-time works. The old “static oracle” greedy training procedure is fully obsolete; there’s no reason to do it that way any more.
I have the sense that language technologies, particularly those relating to grammar, are particularly mysterious. I can imagine having no idea what the program might even do.
I think it therefore seems natural to people that the best solutions would be over-whelmingly complicated. A 200,000 line Java package feels appropriate.
But, algorithmic code is usually short, when only a single algorithm is implemented. And when you only implement one algorithm, and you know exactly what you want to write before you write a line, you also don’t pay for any unnecessary abstractions, which can have a big performance impact.
For a long time, incremental language processing algorithms were primarily of scientific interest. If you want to write a parser to test a theory about how the human sentence processor might work, well, that parser needs to build partial interpretations. There’s a wealth of evidence, including commonsense introspection, that establishes that we don’t buffer input and analyse it once the speaker has finished.
But now algorithms with that neat scientific feature are winning! As best as I can tell, the secret to that success is to be:
- Incremental. Earlier words constrain the search.
- Error-driven. Training involves a working hypothesis, which is updated as it makes mistakes.
The links to human sentence processing seem tantalising. I look forward to seeing whether these engineering breakthroughs lead to any psycholinguistic advances.
The results at the start of the post refer to Section 22 of the Wall Street Journal corpus. The Stanford parser was run as follows:
A small post-process was applied, to undo the fancy tokenisation Stanford adds for numbers, to make them match the PTB tokenisation:
The resulting PTB-format files were then converted into dependencies using the Stanford converter:
I can’t easily read that anymore, but it should just convert every .mrg file in a folder to a CoNLL-format Stanford basic dependencies file, using the settings common in the dependency literature.
I then converted the gold-standard trees from WSJ 22, for the evaluation. Accuracy scores refer to unlabelled attachment score (i.e. the head index) of all non-punctuation tokens.
To train parser.py, I fed the gold-standard PTB trees for WSJ 02-21 into the same conversion script.
In a nutshell: The Stanford model and parser.py are trained on the same set of sentences, and they each make their predictions on a held-out test set, for which we know the answers. Accuracy refers to how many of the words’ heads we got correct.
Speeds were measured on a 2.4Ghz Xeon. I ran the experiments on a server, to give the Stanford parser more memory. The parser.py system runs fine on my MacBook Air. I used PyPy for the parser.py experiments; CPython was about half as fast on an early benchmark.
One of the reasons parser.py is so fast is that it does unlabelled parsing. Based on previous experiments, a labelled parser would likely be about 40x slower, and about 1% more accurate. Adapting the program to labelled parsing would be a good exercise for the reader, if you have access to the data.
The result from the Redshift parser was produced from commit
which was run as follows: