We’ve been running Explosion for about five years now, which has given us a lot of insights into what Natural Language Processing looks like in industry contexts. In this blog post, I’m going to discuss some of the biggest challenges for applied NLP and translating business problems into machine learning solutions.
When you’re starting out in the field and are facing real problems to solve, it’s easy to feel a bit lost. Even though you understand the fundamentals of machine learning, know your way around the industry-standard libraries and have experience in programming and training your own models, you might still feel like something is missing. The right intuition, the right mindset, a different way to reason about what to do. I call this “Applied NLP Thinking”.
It’s a fairly abstract idea, but while I was writing this, I think I came up with a pretty fitting analogy. Like most of the world, I spent even more time indoors in 2020 than I usually do. Maybe I just missed restaurants, but for a while, I got really into watching cooking shows. I was particularly interested in the business side of running a restaurant, and how it ties in with the actual craft of cooking itself.
There’s a lot to learn when it comes to cooking: you need to know many skills and techniques, and be able to reproduce recipes and use all kinds of different equipment. You need to understand the ingredients and how to present your dish, and on top of that, there are fundamentals of nutrition, chemistry and food science. You don’t need to be a food scientist to be a good cook, but understanding how egg, mustard and oil turn into mayonnaise certainly helps.
But there’s more to a restaurant than just the cooking. A restaurant kitchen is basically a little factory. In order to give your factory any chance of success, the menu needs to be designed carefully. The dishes and flavors need to work together and you need to be able to source the ingredients reliably. Every dish needs to be prepared consistently, every single time, without mistakes. The menu also needs to serve the market well and allow a sufficient profit margin so the restaurant can be successful.
Now, I’m not a chef, but I’d be surprised if someone could design a good restaurant menu without also knowing how to cook. You can’t abstract away the details of how everything is done and what adaptions you could make. All of those details matter. But the activity of designing the menu is also quite different from implementing it. The activity of deciding what to do is different from just carrying out the plan.
Applied NLP is more about deciding what to cook, and less about how to cook it. Actually, a big part is even deciding whether to cook – finding the right projects where NLP might be feasible and productive. The process of understanding the project requirements and translating them into the system design is harder to learn because you can’t really get to the “what” before you have a good grasp of the “how”.
The “how” is everything that helps you execute the plan. This includes knowing how to implement models and how they work and various machine learning fundamentals that help you understand what’s going on under the hood. It also includes knowing how to train and evaluate your models, and what to do to improve your results. And of course, you should be familiar with the standard libraries and proficient at programming and software engineering more generally. All of this is the “how”, and it’s what you can learn from books and courses.
The “what” is translating the application goals into your machine learning requirements, to design what the system should do and how you’re going to evaluate it. It includes deciding when to use machine learning in the first place, and whether to use other approaches like rule-based systems instead. It also includes choosing the types of components and models to train that are most likely to get the job done. This requires a deep understanding of what the outputs will be used for in the larger application context. You also need to be able to find the right trade-offs, for instance between speed and accuracy or convenience and flexibility. This includes knowing what resources and libraries are available, and what to use when. The “what” is what matters most for applied NLP – and you can’t solve it without the “how”. But just the “how“ itself is not enough.
To give you an example, here’s an application context you may encounter. Imagine you’re working at a large company, and the IT desk gets a lot of support tickets. Your organization wants the tickets to be answered more quickly and efficiently, and they want to understand what the most common problems are to better prevent them in the future. There are lots of different ways you could address these requirements with NLP. So where do you start?
Maybe you could sort the support tickets into categories, by type of problem, and try to predict that? Or cluster them first, and see if the clustering ends up being useful to determine who to assign a ticket to? If you have data about how long it takes to resolve tickets, maybe you can do regression on that — having cost estimation on tickets can be really helpful in balancing work queues, staffing, or maybe just setting expectations. You could also try and extract key phrases that are likely indicators of a problem. If you can predict those, it could help with pre-sorting the tickets, and you’d be able to point out specific references. However, the boundaries are very unclear, and the key phrases are possibly disjoint. The less consistent your data, the harder it will be to train a model. So you might need a different approach.
Maybe there’s a way to jumpstart the process with custom embeddings that are more specific to the domain and terminology, to squeeze out a higher accuracy. But the system also needs to be useful, so how are you going to measure that? Is your organization tracking the response times, so you can test your NLP system against that? At the end of your brainstorming session, you may end up with a preliminary to-do list: start with the label scheme and get the IT support team involved so you can explain what’s possible to automate and what isn’t. And maybe train a text classifier baseline, bootstrapped with some keywords.
It’s really the answer to the “what” question that will determine whether the project succeeds or fails. It doesn’t matter whether you build a model that achieves high accuracy on your test data if you’ve framed the problem in a way that doesn’t actually address the relevant problems for the application. So how can you get better at answering the “what”?
First, you can’t do applied NLP without thinking about the application. You need to really engage with the purpose of the system you’re trying to build. You can’t just say, “Product decisions are the product people’s job” – unless the “product people” know more about NLP than you do.
You also shouldn’t expect the “what” to be trivial. A problem we see sometimes is that people assume that the “what” is trivial, simply because there’s not much discussion of it, and all you ever hear about is the “how”. If you assume that translating application requirements into a machine learning design is really easy and that the problem is really easy, you may also assume that the first idea that comes to mind is probably the correct one. That can cause some expensive mistakes. Instead, it’s better to assume that the first idea you have is probably not ideal, or might not work at all. It can take some iteration to come up with a successful design.
So how do you know your design is successful? In applied NLP, it’s important to pay attention to the difference between utility and accuracy. “Accuracy” here stands for any objective score you can calculate on a test set — even if the calculation involves some manual effort, like it does for human quality assessments. This is a figure you can track, and if all goes well, that figure should go up. In contrast, the “utility” of the model is its impact in the application or project. This is a lot harder to measure, and always depends on the application context.
Hopefully, your evaluation metric should be at least correlated with utility — if it’s not, you’re really in trouble. But the correlation doesn’t have to be perfect, nor does the relationship have to be linear. I’m using “utility” here in the same sense it’s used in economics or ethics. In economics it’s important to introduce this idea of “utility” to remind people that money isn’t everything. In applied NLP, or applied machine learning more generally, we need to point out that the evaluation measure isn’t everything.
Evaluation measures are important for both research and applications, but their role isn’t exactly the same. It comes back to the different purposes that you have. Research is basically about building a commons of knowledge, and the idea is that this commons overall should be useful. For that to happen, research needs to generalize at least a little bit. It can’t be specific to just one situation. Application is kind of the inverse: you take the general, and adapt it to your specific context.
Another difference is that in research, you’re mostly concerned with figuring out whether your conclusions are true, and maybe quantifying uncertainty about that. In an application, the bottom-line is whether the project is useful. So when you conduct an evaluation in research, you’re trying to isolate your new idea, and you usually want to evaluate exactly the same way as prior work. That way you can test whether your change did what you were hoping it would do. In an application, you’re mostly using the evaluation to choose which systems to try out in production. You’re hoping that a system that scores better on your evaluation should be better in your application. In other words, you’re using the evaluation as a proxy for utility — you’re hoping that the two are well correlated. Often they really are! But you also get to choose the evaluation — that’s a totally legitimate and useful thing to do. In research, changing the evaluation is really painful, because it makes it much harder to compare to previous work.
What “utility” means is different for every application. In our example of a system for processing IT support tickets, a successful application will result in faster processes, as it allows the right person to be assigned quicker. In turn, this can lead to a higher satisfaction: users with problems get help quicker, and IT support staff has to spend less time waiting around, or dealing with tickets outside their core expertise. Categorizing tickets automatically also helps with discovering common problems and knowing what to fix. All of this can lead to higher revenue, or save the company a bunch of money.
People often discuss machine learning as a cost-cutting measure – it can “replace human labour”. But it’s pretty difficult to make this actually work well, because reliability is often much more important than price. If you’ve got a plan that consists of lots of separate steps, the total cost is going to be additive, while the risk multiplies. Something going wrong at any of the steps is going to affect the whole plan. So for applied NLP within business processes, the utility is mostly about reducing variance. Humans and models both make mistakes, but they’re very different mistakes. Humans can easily catch mistakes made by a model, and a model can be great at correcting human errors caused by inattention.
Without the idea of “utility”, it’s hard to talk about why you would prefer one evaluation over another. Let’s say you have two evaluation metrics and they result in different orderings over systems you’ve trained. Which evaluation metric should you prefer? The answer is, “whichever one is better correlated with utility”.
We’ve seen that for applied NLP, it’s really important to think about what to do, not just how to do it. And we’ve seen that we can’t get too focussed on just optimizing an evaluation figure — we need to think about utility. So how can we give people a better toolkit to do this? What should we teach people to make their NLP applications more likely to succeed? I think linguistics is an important part of the answer here, that’s often neglected.
There are lots of different areas that someone doing NLP can focus on, either to work on research or applications. If you’ve ever played a role-playing game, you can think of it like investing points in different skill trees. Let’s say you want to build your character up for “applied NLP” and you know you don’t want to do research. How should you allocate your points?
A common suggestion would be to rank linguistics as the lowest priority. And if you’re going to be allocating your first points, it’s definitely more important to know a bit about programming and machine learning. As for libraries and APIs, this one is a bit special: it turns out the game just gives you points here as you play. You’ll find yourself accidentally fluent in whatever tools are required for your work, so no need to spend any of your points on this explicitly.
For applied NLP, a little bit of linguistics knowledge can go a long way and prevent some expensive mistakes. I’m not saying that you should sink all of your points into maxing out on linguistics – there are diminishing returns. But knowing nothing about linguistics really holds people back. Here are a few examples of linguistic concepts that I think anyone working on applied NLP should be aware of.
Let’s go back to the IT support tickets. You’re working on the task, and someone suggests extracting keywords from the tickets, which can then act like topic tags. The support team will be able to click the topic, and see other tickets with it. You know all about LDA and topic modeling, so you go ahead and create the clusters easily. The unigram topics aren’t very satisfying, though, so you include bigrams and trigrams. Now you have topics like “n’t turn”, which is very common in the data, but not very useful. How do you get topics like “turn on” and “resets”, but not “, but it”?
If you’ve never thought much about language before, it’s normal to expect “a word” to be a simple thing to define and work with. This is especially true if your native language is a language like English where most lexical items are whitespace-delimited and the morphology is relatively simple.
However, even in English there are lots of multi-word expressions: lexical types that have spaces in them and might not even be contiguous, for example, phrasal verbs. The morphology often matters, too. It doesn’t take a lot of learning to understand the difference between a lexical type and a token, and to think about which of them your application should be working with. It’s especially important to understand Zipf’s law, as the numeric relationship between the number of types and the number of tokens in a sample of text is often very relevant. So knowing linguistics can help you constrain your system’s output and make your application more useful.
Another topic that deserves more attention is compositional syntax and semantics. This is especially relevant when you’re designing annotation schemes and figuring out how you want to frame your task.
For instance, let’s take a look at the IT support tickets again. A really common instinct we see people have is that they want their model to identify subsequences of text. For instance, someone might suggest: Let’s label the phrases that indicate the problem and train an entity recognizer on that! The issue here is that “the problem” isn’t reliably going to be expressed as any particular constituent. It might not even be within a single predicate-argument structure. This means that the spans of text you’ll be annotating are really arbitrary. Your annotators will never agree closely on what to annotate, and at the end of it all, what are you even going to do with the output? If you’re not returning phrases of a specific type, it will be very difficult to do anything with the extracted spans at the end.
In this example, one solution is to model the problem as a text classification task. This will be a lot more intuitive to annotate consistently, and you’ll only need to collect one decision per label per text. This also makes it easier to get subject matter experts involved – like your IT support team. The model you train will only have to predict labels over the whole text, and the output it produces will be more useful for the downstream application.
Sometimes, a flat label really isn’t enough, though. For instance, in most chat bot contexts, you want to take the text and resolve it to a function call, including the arguments. It’s really important to have some understanding of syntax and semantics if you’re doing that. Syntax will help you define the argument boundaries properly, because you really want your arguments to be syntactic constituents – it’s the only way to make them consistent. And semantics will help you understand why the actual texts will be much more complicated than the subject-verb-object examples your team might be thinking up.
There’s much less written about applied NLP than about NLP research, which can make it hard for people to guess what applied NLP will be like. In a lot of research contexts, you’ll implement a baseline and then implement a new model that beats it. So it’s kind of natural to guess that applied NLP will be like that, except without the “new model” part. That’s not entirely untrue, but it leaves out some really important things. Research has a very different goal, which requires a different mindset. If you imagine doing applied NLP without changing that mindset, you’ll come away with a pretty incorrect impression.
The most important thing for applied NLP is to come in thinking about the product or application goals. You can’t abstract those away and just grind on an optimization problem. You’ll never ship anything valuable that way, and you might even ship something harmful. Instead, you need to try out different ideas for the data, model implementation and even evaluation. You shouldn’t expect deciding what to do to be trivial or obvious, and you especially shouldn’t assume your first idea will be the best one.
You should also keep in mind that evaluation will have a different role within your project. However you’re evaluating your models, the held-out score is only evidence that the model is better than another. You need to go and verify that. Accuracy isn’t the same thing as utility. You should expect to check the utility of multiple models, which means you’ll need to have a smooth path from prototype to production. You shouldn’t expect to just work in Jupyter notebooks on your local machine.
Applied NLP gives you a lot of decisions to make, and these decisions are often hard. It’s important to iterate, but it’s also important to build a better intuition about what might work and what might not. There are some topics in linguistics that can really help here. Obviously, there’s a lot of other things to learn as well, but it’s worth putting some points into the linguistics skill tree, once you’re up to speed with solid programming skills and a good conceptual overview of machine learning.
Applied NLP isn’t just like research NLP but with less novelty and less thinking. It requires a different thinking: applied NLP thinking.