Statistical NLP in the Ten Hundred Most Common English Words

by Matthew Honnibal on

Because we've been doing the same thing for a long time, sometimes we get very used to talking about our work in words that most people don't use. So, here's another take on it, in the style of thing explainer.

When I was little, my favorite TV shows all had talking computers. Now I'm big and there are still no talking computers. At least, not really talking. We can make them, like, say things — but I want them to tell us things. And I want them to listen, and to read. Why is this so hard?

It turns out that almost anything we say could mean many many different things, but we don't notice because almost all of those meanings would be weird or stupid or just not possible. If I say:

I saw a movie in a dress

Would you ever ask me,

“Were you in the dress, or was the movie in the dress?”

It's weird to even think of that. But a computer just might, because there are other cases like:

The TV showed a girl in a dress

Where the words hang together in the other way. People used to think that the answer was to tell the computer lots and lots of facts. But then you wake up one day and you're writing facts like movies do not wear dresses, and you wonder where it all went wrong. Actually it's even worse than that. Not only are there too many facts, most of them are not even really facts! People really tried this. We've found that the world is made up of ifs and buts.Unconstrained VocabularyIf you have a fixed constraint like People wear dresses, and Movies are not people, how does the system cope when someone talks about dressing a script? Even if nobody has ever said this before, someone might in future. Language is creative, and exceptions are the rule.

These days we just show the computer lots and lots and lots of words. We gave up trying to get it to understand what a “dress” is. We let dress be just some letters. But if it is seen it around girl enough times (which is just some other letters, which are seen around some other other letters), it can make good guesses.

It doesn't always guess right, but we can tell how often it does, and we can think of ways it help it learn better. We have a number, and we can slowly make it bigger, a little bit by a little bit.

(One thing I've learned is, people are great at making a number bigger, if you pay a lot of them to try. The key is to pick numbers where, if they make the number bigger, they can't help but have done something actually good. This is harder than it sounds. Some say no numbers are like this. I ask them to show me much good being done another way, but they never can.)Unconstrained VocabularyThe potential problem with focusing on a benchmark task is Goodhart's Law. The AI community is conscious of the problem and has done well at averting it.

Instead of telling the computer facts, what we needed to do was tell it how to learn.

The ideas we come up with for getting the computer to talk, listen or read a little better can be used to get it to see or plan a little better, and the other way around. Once we stopped telling it things like “movies do not wear dresses”, things really took off.

Each bit of work still only makes our numbers a little bit bigger, and the bigger the numbers go, the harder they are to raise. But that is a good problem to have. Now that computers can read quite well, I think we should be able to do pretty great things. What should we get them to read?

Matthew Honnibal

About the Author

Matthew Honnibal

Matthew is a leading expert in AI technology, known for his research, software and writings. He completed his PhD in 2009, and spent a further 5 years publishing research on state-of-the-art natural language understanding systems. Anticipating the AI boom, he left academia in 2014 to develop spaCy, an open-source library for industrial-strength NLP.

Join our mailing list

Stay in the loop!