Supervised learning is great — it's data collection that's broken

· by Ines Montani & Matthew Honnibal · ~11 min. read
Image by Agata Sasiuk

Update (January 2018)

We wrote this post while working on Prodigy, our new annotation tool for radically efficient machine teaching. Prodigy features many of the ideas and solutions for data collection and supervised learning outlined in this blog post. It’s a cloud-free, downloadable tool and comes with powerful active learning models. For more details, see the website or try the live demo.

Try Prodigy!

Most AI systems today rely on supervised learning: you provide labelled input and output pairs, and get a program that can perform analogous computation for new data. Supervised learning algorithms have been improving quickly, leading many people to anticipate a new wave of entirely unsupervised algorithms: algorithms so “advanced” they can compute whatever you want, without you specifying what that might be. This is like hoping for a programming language so advanced you don’t even need to write a program.

Supervised learning is often seen as inconvenient and expensive: you don’t only need a lot of examples — they also need to be labelled. This means that at some point in the process, a human has to assign those labels, and they should match the labels the system should predict. In order to achieve meaningful results, the examples and labels need to be as specific to your application as possible. This is why many have placed their bets on unsupervised learning. If we can enable the computer to detect hidden structures in the training examples and come up with its own rules to label them, we can finally train our system on the knowledge of the world and stop relying on a human’s input, right?

The problem is that there’s any number of “structures” that an unsupervised algorithm might recover. Sometimes the unsupervised algorithm will happen to produce the output you want, but other times it won’t. If it doesn’t, there’s not much you can do — by definition, you’ve chosen an approach you have little control over. Supervised learning is not the problem. The problem is how we’re currently creating these annotations — a part of the AI process that has received surprisingly little innovation.

How machines “learn”

To understand what supervised learning actually means, take a look at this example for training a simple part-of-speech tagger — a program that can tell you whether each word in a sentence is a noun, verb, adjective, etc. The function takes as input a sequence of examples. Each example consists of a context and its correct tag, provided by a human annotator. The output of the function is the weights table, W, which can be used to predict a tag given the context. To keep the example simple, the context consists of only three pieces of evidence: the word being tagged, and its two immediate neighbours.

Part-of-speech taggerdef train_tagger(examples):
    n_tags = max(tag for features, tag in examples)
    W = defaultdict(lambda: numpy.zeros(n_tags))
    for (word, prev, next), human_tag in examples:
        scores = W[word] + W[prev] + W[next]
        guess = scores.argmax()
        if guess != human_tag:
            for feat in (word, prev, next):
                W[feat][guess] -= 1
                W[feat][human_tag] += 1
    return W

When we start our training process, the weights in W are all 0, so no matter what context we see, we’ll judge all the tags to be equally likely. In other words, we don’t know anything yet — we start off with no assumptions. To learn how we should weigh the evidence in the context, we take an error-driven approach. We iterate over the examples, get the current scores for each tag given its context and select the best-scoring one, i.e. the one that our theory thinks is most likely correct. If it matches the tag a human has assigned to it — great. If not, we decrease the score for the “bad” tag in this context, and increase the score for the “good” one, i.e. the human-assigned tag. This is a simple as adding and subtracting a point.

Looking at this example, it’s clear that the human_tag is the most crucial part here. If our human data is good, we’ll quickly be able to achieve a pretty decent accuracy on the task. But if our human data is bad and contains mistakes and inconsistencies, we’ll end up increasing scores on the wrong tags, resulting in a much worse model.

Where human knowledge in AI really comes from

Knowledge can be extracted from all kinds of freely available sources – for example, you can use Wikipedia’s disambiguation data, or predict sentiment from emoji on Reddit comments. But no matter what application you’re building, you usually need at least some data specific to your problem, and this data will need to be annotated by humans.

The most popular place to source large volumes of annotated data is Amazon Mechanical Turk, the Amazon Cloud of human labour. You can use their platform to publish survey-style “Human Intelligence Tasks” (HIT), which will be completed by workers from all over the world. While this sounds great in theory, it’s often disastrous in practice. The workers make around $5 an hour on average, with no connection to the task, and interfaces reminiscent of early-2000s-style surveys. Incentives are also completely misaligned, so you have to worry about being cheated by the workers — who have to worry about being cheated by you. It’s quite ironic that our oh so progressive and world-changing AI gets its knowledge from… this.

A Human Intelligence Task on Amazon Mechanical Turk. (Images:,

So no wonder your data is bad. Don’t expect great data if you’re boring the shit out of underpaid people. The thing is, none of this is news. Our so-called start-up culture is based on the realisation that in order to achieve the best results, we need an engaged team that’s passionate about their work, a motivating work environment, high incentives and fair pay. We know all of this. Yet, when it comes to the absolute core of the application, the training data, all of this knowledge seems to go straight out of the window.

The problems with Mechanical Turk are not a secret and there have been many attempts at designing around them. But instead of “designing around” underpaid people doing boring work with bad incentives, data collection should receive the same treatment as all other human-facing interactions. Imagine all the knowledge you’d be able to collect if you spent as much time on your data collection process as you did on, say, the user onboarding flow of your app.

Solution #1: UX-driven data collection with active learning

When humans interact with machines, their experience is what decides about the success of the interaction. I’ve already talked about some of this in my post on the importance of front-end development for AI:

If your tools are bad, the task will be boring and frustrating, and as a result, the workers’ input quality will be low. Believing that you can make your annotation tools bad because labour should be cheap is like believing that your offer is so valuable that your users shouldn’t care if your website is confusing and hard to use.

Ines Montani · How front-end development can improve Artificial Intelligence

The more time, clicks and effort it requires to complete a task, the less efficient the result. Even the most subtle changes to a user interface can have a noticable impact, for instance on converting visitors to users or users to paying customers. There’s a reason why Tinder doesn’t ask you to type in a comma-separated list of the full names of everyone you’d like to talk to. Their card-based UI reduces a complex interaction to one intuitive motion: swipe left and swipe right. It’s so effective because it reduces friction between the user and the interface.

A user interface that requires 3 different actions (highlight, select, click) vs. one binary decision using the model's predictions

Human time and attention is precious. Instead of presenting the annotators with a span of text that contains an entity, asking them to highlight it, select one of many labels from a dropdown and confirm, you can break the whole interaction down into a simple binary decision. You’ll have to ask more questions to get the same information, but each question will be simple and focused. You’ll collect more user actions, giving you lots of smaller pieces to learn from, and a much tighter feedback loop between the human and the model. You don’t need to ask the questions using a fixed queue. As the human clicks through the examples, you can prioritise the questions, using the current state of the model. This puts the computer in charge of what it’s good at – memory and consistency.

An annotation workflow using active learning vs. conventional "batch learning"

The above graphic shows an annotation project using this active learning approach. Instead of asking the human to annotate a fixed batch of tasks, the model selects a task and presents it to the human for annotation. The annotated single task can then immediately adjust the model’s internal weights and influence its choice of what to ask next. A simple policy is to ask what the model is least sure about, although a range of other strategies have been explored.

Solution #2: Transfer learning with general-purpose models

Regardless of the application you’re building, it will always require some general knowledge about language and the world, from basic grammar to a variety of phrases, expressions and entities. Think of it like hiring a new junior employee – you know you’ll have to spend time teaching them about the specifics of the work and you don’t expect them to know everything right away. But you do want them to speak a language and know how to use a computer. You’re not going to raise them from birth.

Pre-trained models let you jump-start your application with general information, which you can then fine-tune and improve to fit your custom needs. With deep learning, you can even chain together multiple models, and adjust the whole pipeline based on the eventual, task-specific error. Normally, the reuseable components will be the lowest, least abstract layers of the network. For instance, consider the following sketch of an intent recognition model:

This model takes a user input and returns an intent, i.e. a prediction of what the user might want. The example input, “whats the best way to catalinas”, could mean a lot of things. In a social app, the user might want to get directions to their friend Catalina’s house. A restaurant app might want to show the local fast food restaurant, Catalinas. Or maybe you’re building a Latin American travel app and your user wants to book a trip to the beach town Las Catalinas in Puerto Rico.

To figure this out, the model computes a series of internal representations, which encode generalisable information, about the language and the world. For example, the phrase meanings (likely computed using a CNN or RNN layer) will let the model determine that “best way” is an expression that means something much more specific than the sum of its parts. The entity labels can take care of assigning the correct label to “catalinas” to figure out whether it’s a person, a place, a geopolitical entity or something entirely different.

But what if the intent recogniser assigns the label PERSON to “catalinas”, when the user was actually interested in a local fast food restaurant? If you own the weights of your model, you’re not only able to fix the mistaken output, but also correct all the wrong assumptions that led to it. Transfer learning is especially advantageous if you have full write access to your model. If you don’t — for instance, if you’re consuming a Cloud API — you have to settle for less principled and much less effective approaches to fine-tuning.


If you’re using SaaS provider and only get access via an API, your model will essentially stay a black box. You’re not easily able to backpropagate to correct the model’s internal representations. All you can do is add workarounds for your read-only model or retrain it with more examples. This challenges the current trends towards thin clients and centralised cloud computing, and is one of many reasons to bet against Machine Learning as a Service.

Machine Learning as a Service is an idea we’ve been seeing for nearly 10 years and it’s been failing the whole time. The bottom line on why it doesn’t work: the people that know what they’re doing just use open source, and the people that don’t will not get anything to work, ever, even with APIs.

Bradford Cross · Five AI Startup Predictions for 2017

Supervised learning is only just getting good, and with transfer learning and fully differentiable models it’s constantly getting better. It doesn’t make sense to give up on controlling the model’s output just because our annotation tooling and processes suck. There’s definitely a problem here, but it’s not the concept of supervised learning. At Explosion, we’re making this one of our priorities, and we’re looking forward to sharing the results.

Update (January 2018)

We wrote this post while working on Prodigy, our new annotation tool for radically efficient machine teaching. Prodigy features many of the ideas and solutions for data collection and supervised learning outlined in this blog post. It’s a cloud-free, downloadable tool and comes with powerful active learning models. For more details, see the website or try the live demo.

Try Prodigy!

About the authors

  • Ines Montani CEO, Founder

    • Berlin, Germany
  • Matthew Honnibal CTO, Founder

    • Berlin, Germany