Against LLM maximalism

May 18, 2023
26 minute read
Blog
Evaluation
LLMs, NLP Strategy
Matthew Honnibal

A lot of people are building truly new things with Large Language Models (LLMs), like wild interactive fiction experiences that weren’t possible before. But if you’re working on the same sort of Natural Language Processing (NLP) problems that businesses have been trying to solve for a long time, what’s the best way to use them?

Companies have been using language technologies for many years now, often with mixed success. The need to work with text or speech data somewhat intelligently is pretty fundamental. For instance, in most popular websites text is usually either a big part of the product (e.g. the site publishes news or commentary), the usage pattern (e.g. users write text to each other) or the input (e.g. news aggregators). Transacting in language makes up a large percentage of economic activity in general. We all spend a big part of our working lives writing, reading, speaking and listening. So, it makes sense that do-things-with-language has long been a desirable feature request for all sorts of programs. In 2014 I started working on spaCy, and here’s an excerpt of how I explained the motivation for the library:

Computers don’t understand text. This is unfortunate, because that’s what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorize it, generate it and correct it. spaCy provides a library of utility functions that help programmers build such products.

Today it’s hard to make the claim computers don’t understand text without at least adding huge asterisks or qualifications. Even if you have a particular philosophical view of what constitutes “understanding”, there’s no doubt that LLMs can construct and manipulate meaning representations sufficient for a wide range of practical purposes. They still make all sorts of mistakes, but it’s relatively rare to feel like they’ve simply failed to connect the words of your input together to form the intended meaning. The text that they generate is also extremely fluent. Sometimes it can be confidently wrong or irrelevant to your question, but it’s almost always made up of real, coherent sentences. That’s definitely new.

However, LLMs are not a direct solution to most of the NLP use-cases companies have been working on. They are extremely useful, but if you want to deliver reliable software you can improve over time, you can’t just write a prompt and call it a day. Once you’re past prototyping and want to deliver the best system you can, supervised learning will often give you better efficiency, accuracy and reliability than in-context learning for non-generative tasks — tasks where there is a specific right answer that you want the model to find. Applying rules and logic around your models to do data transformations or handle cases that can be fully enumerated is also extremely important.

For instance, let’s say you want to build an online reputation management system. You want to ingest a feed of posts from Twitter, Reddit or some other source, identify mentions of your company or products, and understand common themes between them. Perhaps you also want to monitor mentions of key competitors in a similar way. You might want to view the data in a variety of ways. For instance, you could extract a few noisy metrics, such as a general “positivity” sentiment score that you track in a dashboard, while you also produce more nuanced clustering of the posts which are reviewed periodically in more detail.

I don’t want to undersell how impactful LLMs are for this sort of use-case. You can give an LLM a group of comments and ask it to summarize the texts or identify key themes. And the specifics of this can change at runtime: you don’t have to target a very particular sort of summary or question that will be posed. This generative output could be a complete game-changer, finally delivering the “insights” that data science projects have generally over-promised and under-delivered. In addition to these generative components, you could also use LLMs to help with various other parts of the system. But should you?

LLMs are new enough, and changing quickly enough, that there’s little consensus on how best to use them. Eventually things will stabilize (or perhaps the AGI will get us all after all), and we’ll have some scars and accreted wisdom about what works and what doesn’t. In the meantime, I want to suggest some common sense.

One vision for how LLMs can be used is what I’ll term LLM maximalist. If you have some task, you try to ask the LLM to do it as directly as possible. Need the data in some format? Ask for it in the prompt. Avoid breaking down your task into several steps, as that will prevent the LLM from working through your problem end-to-end. It also introduces extra calls, and can introduce errors in the intermediate processing steps. Of course, LLMs do have limitations, for instance in the recency of their knowledge, or the size of the context you can pass in. So you do have to work around things, and use things like vector databases or other tricks. But fundamentally the LLM maximalist position is that you want to trust the LLM to solve the problem. You’re preparing for the technologies to continue to improve, and the current pain-points to keep reducing over time.

Left: "Noo you can't just mix up all the steps of your task and ask an LLM to do it all. How will you ever make a reliable and extensible system that way?" Right: "Haha LLM go brrr" — It's never great to realize you're the guy on the left. But, here I am.

There are two big problems with this approach. One is that “working around the system’s limitations” is often going to be outright impossible. Most systems need to be much faster than LLMs are today, and on current trends of efficiency and hardware improvements, will be for the next several years. Users are pretty tolerant of latency in chat applications, but in almost any other type of user interface, you can’t wait multiple seconds for a single prediction. It’s just too slow. In our online reputation management example, you want to connect to some sort of firehose of data, like Reddit or Twitter. You can’t pass that straight into an LLM — it’s much too expensive.

The second problem is that the LLM maximalist approach is fundamentally not modular. Let’s say you’ve made the obvious small compromise, and you’re using a separate classifier to pre-filter texts that maybe mention your company. You’ve developed a prompt that works well with the LLM model you’re using, and you’re getting pretty good output summaries. Now you get a new request. Users want to be able to view some of the raw data. To address this need, your team decides to develop a separate view, where you display the list of mentions along with one sentence of context.

How should you go about this? You could craft a separate LLM prompt, where you ask it to extract the data with this second format you’ve been asked for. However, the new prompt is not guaranteed to recognize the same set of mentions as the first prompt you’ve used — inevitably, there will be some differences. This is really not great. You want to be able to link the summaries to the groups of comments they were generated from. If the sentence view is different, you can’t do that. Instead of a separate prompt, you could try to add the information to the first prompt. But now you’re outputting whole sentences, which really increases the number of tokens you generate, which is both slower and more expensive. And you’re struggling to get the same accuracy you had before with this new more complicated output format.

What makes a good program? It’s not only how efficiently and accurately it solves a single set of requirements, but also how reliably it can be understood, changed and improved. Programs written with the LLM maximalist approach are not good under these criteria.

Instead of throwing away everything we’ve learned about software design and asking the LLM to do the whole thing all at once, we can break up our problem into pieces, and treat the LLM as just another module in the system. When our requirements change or get extended, we don’t have to go back and change the whole thing. We can add new modules, or rearrange the ones we’re already happy with.

Breaking up the task into separate modules also helps you to see which parts really need an LLM, and what could be done more simply and reliably with another approach. Recognizing sentence boundaries in English isn’t entirely trivial (you don’t want to just use regular expressions), but it’s definitely not something you need an LLM to do. You can just call into spaCy or another library. It will be vastly faster, and you won’t have to worry that the LLM will trip over some strange input and return some entirely unexpected output.

The task of detecting the company mentions is also something you probably don’t need to use an LLM to do. It certainly makes sense to use an LLM for the initial prototyping — this is another huge advantage of LLMs that should not be underestimated. Rapid prototyping is enormously important. You can explore the design space efficiently, and discard ideas that aren’t worth further development. But you also need to be able to go past prototyping. Once you’ve found an idea that’s worth improving, you need a way to actually improve it.

Before you can improve any statistical component, you need to be able to evaluate it. It’s important to have some evaluation over your whole pipeline, and if you have nothing else, you can use that to judge whether some change to a component is making things better or worse (this is called “extrinsic evaluation”). But you should also evaluate your components in isolation (“intrinsic evaluation”). For components like the mention detector, this means annotating some texts with the correct labels, setting them aside, and testing your component against them after each change. For generative components, you can’t evaluate against a single set of annotations, but intrinsic evaluation is still possible, for instance using a Likert scale or A/B testing.

Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.

A good rule of thumb is that you’ll want ten data points per significant digit of your evaluation metric. So if you want to distinguish 91% accuracy from 90% accuracy, you’ll want to have at least 1000 data points annotated. You don’t want to be running experiments where your accuracy figure says a 1% improvement, but actually you went from 94/103 to 96/103. You’ll end up forming superstitions, based on little but luck. That’s not a path to improvement. You need to be systematic.

Once you’ve annotated evaluation data, it’s usually better to just go on and annotate some training data for non-generative components. Supervised learning is very strong for tasks such as text classification, entity recognition and relation extraction. If you have a clear idea of what the component should do and can annotate data accordingly, you can usually expect to get better accuracy than an LLM with a few hundred labelled examples, using a transformer architecture sized for a single GPU, with pretrained representations. This is really just the same model architecture as an LLM, but in a more convenient size, and configured to perform exactly one task.

Here’s how I think LLMs should be used in NLP projects today — an approach I would call LLM pragmatism.

Break down what you want your application to do with language into a series of predictive and generative steps.
Keep steps simple, and don’t ask for transformations or formatting you could easily do deterministically.
Put together a prototype pipeline, using LLM prompts or off-the-shelf solutions for all the predictive or generative steps.
Try out the pipeline in as realistic a context as you can.
Design some sort of extrinsic evaluation. What does success look like here? Net labour saved? Engagement? Conversions? If you can’t measure the utility of the system directly, you can use some other sort of metric, but you should try to make it as meaningful as possible. If false negatives matter more than false positives, account for that in your extrinsic evaluation metric.
Experiment with alternative pipeline designs. Try to create tasks where the correct answer makes sense independent of your use-case. Prefer text classification to entity recognition, and entity recognition to relation extraction (faster to annotate, accuracy is better).
Pick a predictive (as opposed to generative) component and spend two to five hours labelling annotation data for it.
Measure the LLM-powered component’s accuracy using your evaluation data.
Use the LLM-powered component to help you create training data, to train your own model. One approach is to simply save out the LLM-powered component’s predictions, and trust they’re good enough. This is a good thing to try if the LLM-powered component’s accuracy seems more than sufficient for your needs. If you need better accuracy than what the LLM is giving you, you need example data that’s more correct. A good approach is to load the LLM predictions into an annotation tool and fix them up.
Train a supervised model on your new training data, and evaluate it on the same evaluation data you used previously.
To decide whether you should annotate more training data, run additional experiments where you hold part of your training data back. For instance, compare how your accuracy changed when you use 100%, 80% and 50% of the data you have. This should help you determine how your accuracy might look if you had 120% or 150%. However, be aware that if your training set is small, there can be a lot of variance in your accuracy. Try a few different random seeds to figure out how much your accuracy changes simply due to chance, to help put your results into perspective.
Repeat this process with any other predictive components.

At the end of all this, you’ll have a pipeline suitable for production. It will run much more quickly and accurately than a chain of LLM calls, and you’ll know that no matter what text is passed in, your predictive components will always give you valid output. You’ll have evaluations for the different steps, and you’ll be able to attribute errors to different components. If you need to change the system’s behavior, you’ll be able to put in new rules or transformations at different points of the pipeline, without having to go back and re-engineer your prompts or rewrite your response parsing logic.

Some of these steps are still a bit harder than they should be, especially if you haven’t been working with machine learning before. This is where I have high hopes for LLMs. LLMs are indeed very powerful, and they can make a lot of things much easier. They can help us build better systems, and that’s how we should use them. The approach that I’ve called LLM maximalist is actually unambitious. It uses LLMs to easily get to a system that is worse — worse in cost, worse in runtime, worse in reliability, worse in maintainability. Instead, we should use LLMs to help us get to systems which are better. This means leaning on LLMs much more during development, to break down knowledge barriers, create data, and otherwise improve our workflow. But the goal should be to call out to LLMs as little as possible during runtime. Let LLMs train their own cheaper and more reliable replacements.

Appendix 1: Putting it into practice

There are lots of tools and libraries you can use to label your own data and train your own NLP models. HF Transformers and spaCy (our library) are the two most popular libraries for this. Transformers makes it easy to use a wide variety of models from recent research, and it’s closer to the underlying ML library (PyTorch). spaCy has better data structures for working with annotations, support for pipelines that mix statistical and rule-based operations, and more frameworky features for configuration, extension and project workflows. We recently released spacy-llm, an extension that lets you add LLM-powered components to your spaCy pipelines. You can mix LLM with other components, and make use of spaCy’s Doc, Span, Token and other classes to make use of the annotations.

Let’s say you want to create a pipeline that detects some entities, and then you want to get the sentences that the entities occur in. Here’s how that looks, building and using the pipeline in Python:

Usage example
import spacy

nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
nlp.add_pipe(
    "llm",
    config={
        "task": {
            "@llm_tasks": "spacy.NER.v1",
            "labels": "SAAS_PLATFORM,PROGRAMMING_LANGUAGE,OPEN_SOURCE_LIBRARY"
        },
        "model": {
            "@llm_models": "spacy.Davinci.v2",
        },
    },
)

doc = nlp("There's no PyTorch bindings for Go. We just use Microsoft Cognitive Services.")
for ent in doc.ents:
    print(ent.text, ent.label_, ent.sent)

The sentencizer component in spaCy uses rules to detect sentence boundaries, and the llm component is here configured to perform named entity recognition, with the given labels. Task handlers are provided for some other common NLP tasks, but you can also define your own functions to perform arbitrary tasks — you just need to add a decorator to your function and obey the correct signature.

The sentence and entity annotations (accessed via the doc.sents and doc.ents in this example) are both accessible as sequences of Span objects, which is like a labelled slice of the Doc object. You can iterate over the tokens in the span, get its start and end character offset, and (depending on the components you in your pipeline) access embeddings or compute similarities. Components can assign multiple overlapping layers of Span annotations, and you can design and assign extension attributes to conveniently access additional properties.

Obviously, I’m pretty proud of these parts of spaCy, but what do they have to do with LLMs? You can prompt an LLM with a command like, “How many paragraphs in this review say something bad about the acting? Which actors do they frequently mention?”. For once-off personal tasks, this is absolutely magical. But if you’re building a system and you want to calculate and display this information for every review, it’s very nice to just approach this as separate prediction tasks (tagging names, linking them to a knowledge base, and paragraph-level actor sentiment), with data structures that let you flexibly access the information. The new LLM support in spaCy now lets you plug in LLM-powered components for these prediction tasks, which is especially great for prototyping. This functionality is still quite new and experimental, but it’s already very fun to explore.

Another piece of tooling you’ll need is some sort of solution for data annotation. You can just load things up in a spreadsheet or text editor, but if you’re doing this repeatedly, it’s worth using something better. We make a commercial annotation tool, Prodigy, which emphasizes customizability and model-assisted workflows. Prodigy isn’t free, but it’s a one-time purchase per license, and it fits well into local workflows.

A key design idea in Prodigy is model assistance: calling into a model to get initial annotations, and letting you review and fix them. This works especially well with LLMs, and we’ve been building out support for it over the last six months. Prodigy v1.12 will feature integrated support for LLM annotation assistance, with support for a choice of backends, including open-source solutions you can host yourself. Prodigy also supports A/B evaluation for generative outputs, which applies especially well for LLMs. This functionality is also extended in v1.12. For instance, you can design a number of different prompts, and run a tournament between them, by answering a series of A/B evaluation questions where you pick which of two outputs is better without knowing which prompt produced them. This lets you perform prompt engineering systematically, based on decisions you can record and later review.

Appendix 2: Accuracy of supervised and in-context learning

Large Language Models (LLMs) can be used for arbitrary prediction tasks, by constructing a prompt describing the task, giving the labels to predict, and optionally including a relatively small number of examples in the prompt. This approach doesn’t involve any direct updates to the model for the new task. However, LLMs seem to learn a general ability to continue patterns, including abstract ones, from their language model objective. The mechanics are still under investigation, but see for example Anthropic’s work on induction heads (Olsson et al., 2022).

In supervised learning (often referred to as fine-tuning, in the context of language models), the model is provided a set of labelled example pairs, and the weights are adjusted such that some objective function is minimized. In modern NLP, supervised learning and language model pretraining are closely linked. Knowledge about the language generalizes between tasks, so it’s desirable to somehow initialize the model with that knowledge. Language model pretraining has proven to be a very strong general answer to this requirement. In my opinion the best things to read on this are articles from when the developments were relatively fresh, such as Sebastian Ruder’s 2018 blog post NLP’s ImageNet moment has finally arrived.

OpenAI evaluated GPT-3’s in-context learning capabilities against supervised learning in a variety of configurations (Brown et al., 2020). The results in Section 3.7, on the SuperGLUE benchmark, are the most directly relevant to general NLP prediction tasks such as entity recognition or text classification. In their experiments, OpenAI prompted GPT3 with 32 examples of each task, and found that they were able to achieve similar accuracy to the BERT baselines. These results were the first big introduction to in-context learning as a competitive approach, and they’re indeed impressive. However, they were well below the state-of-the-art accuracy when they were published, and the current state-of-the-art results on the SuperGLUE leaderboard all involve supervised learning, not just in-context learning. Some subtasks of the SuperGLUE benchmark suite have very small training sets, and on these tasks in-context learning is competitive. I’m not aware of any current NLP benchmarks where more than a few hundred training samples are available, and the leading systems rely solely on in-context learning.

The point of in-context learning has never been to be the absolute highest accuracy way to have a model perform some specific task. Rather, it’s an impressive compromise: it’s extremely sample efficient (you don’t need many examples of your task), and you don’t have to pay the upfront computational cost of training. In short, the advantage of in-context learning is lower overhead. But the longer your project lives, the less this should be seen as a dominant advantage. The overheads get amortized away.

Finally, it’s important to realize that SuperGLUE and other standard NLP benchmarks are specifically designed to be quite challenging. Easy tasks don’t make good benchmarks. This is exactly opposite to NLP applications, where we want tasks to be as easy as possible. Most practical tasks don’t require powerful reasoning abilities or extensive background world knowledge, which are the things that really set LLMs apart from smaller models. Rather, practical tasks usually require the model to learn a fairly specific set of policies, and then apply them consistently. Supervised learning is a good fit for this requirement.

How to advocate for modular NLP in the age of Generative AI

Against LLM maximalism

Appendix 1: Putting it into practice

Usage example

Appendix 2: Accuracy of supervised and in-context learning

How to advocate for modular NLP in the age of Generative AI

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

What the history of the web can teach us about the future of AI

From PDFs to AI-ready structured data: a deep dive