Most AI systems today rely on supervised learning: you provide labelled input and output pairs, and get a program that can perform analogous computation for new data. Supervised learning algorithms have been improving quickly, leading many people to anticipate a new wave of entirely unsupervised algorithms: algorithms so “advanced” they can compute whatever you want, without you specifying what that might be. This is like hoping for a programming language so advanced you don’t even need to write a program.
Supervised learning is often seen as inconvenient and expensive: you don’t only need a lot of examples — they also need to be labelled. This means that at some point in the process, a human has to assign those labels, and they should match the labels the system should predict. In order to achieve meaningful results, the examples and labels need to be as specific to your application as possible. This is why many have placed their bets on unsupervised learning. If we can enable the computer to detect hidden structures in the training examples and come up with its own rules to label them, we can finally train our system on the knowledge of the world and stop relying on a human’s input, right?
The problem is that there’s any number of “structures” that an unsupervised algorithm might recover. Sometimes the unsupervised algorithm will happen to produce the output you want, but other times it won’t. If it doesn’t, there’s not much you can do — by definition, you’ve chosen an approach you have little control over. Supervised learning is not the problem. The problem is how we’re currently creating these annotations — a part of the AI process that has received surprisingly little innovation.
How machines “learn”
To understand what supervised learning actually means, take a look at this
example for training a simple part-of-speech tagger — a program that can tell
you whether each word in a sentence is a noun, verb, adjective, etc. The
function takes as input a sequence of examples. Each example consists of a
context and its correct tag, provided by a human annotator. The output of the
function is the weights table, W
, which can be used to predict a tag given the
context. To keep the example simple, the context consists of only three pieces
of evidence: the word being tagged, and its two immediate neighbours.
Part-of-speech tagger
def train_tagger(examples):n_tags = max(tag for features, tag in examples)W = defaultdict(lambda: numpy.zeros(n_tags))for (word, prev, next), human_tag in examples:scores = W[word] + W[prev] + W[next]guess = scores.argmax()if guess != human_tag:for feat in (word, prev, next):W[feat][guess] -= 1W[feat][human_tag] += 1return W
When we start our training process, the weights in W
are all 0
, so no matter
what context we see, we’ll judge all the tags to be equally likely. In other
words, we don’t know anything yet — we start off with no assumptions. To learn
how we should weigh the evidence in the context, we take an error-driven
approach. We iterate over the examples, get the current scores for each tag
given its context and select the best-scoring one, i.e. the one that our theory
thinks is most likely correct. If it matches the tag a human has assigned to it
— great. If not, we decrease the score for the “bad” tag in this context, and
increase the score for the “good” one, i.e. the human-assigned tag. This is a
simple as adding and subtracting a point.
Looking at this example, it’s clear that the human_tag
is the most crucial
part here. If our human data is good, we’ll quickly be able to achieve a pretty
decent accuracy on the task. But if our human data is bad and contains mistakes
and inconsistencies, we’ll end up increasing scores on the wrong tags, resulting
in a much worse model.
Where human knowledge in AI
Knowledge can be extracted from all kinds of freely available sources – for example, you can use Wikipedia’s disambiguation data, or predict sentiment from emoji on Reddit comments. But no matter what application you’re building, you usually need at least some data specific to your problem, and this data will need to be annotated by humans.
The most popular place to source large volumes of annotated data is Amazon Mechanical Turk, the Amazon Cloud of human labour. You can use their platform to publish survey-style “Human Intelligence Tasks” (HIT), which will be completed by workers from all over the world. While this sounds great in theory, it’s often disastrous in practice. The workers make around $5 an hour on average, with no connection to the task, and interfaces reminiscent of early-2000s-style surveys. Incentives are also completely misaligned, so you have to worry about being cheated by the workers — who have to worry about being cheated by you. It’s quite ironic that our oh so progressive and world-changing AI gets its knowledge from… this.
So no wonder your data is bad. Don’t expect great data if you’re boring the shit out of underpaid people. The thing is, none of this is news. Our so-called start-up culture is based on the realisation that in order to achieve the best results, we need an engaged team that’s passionate about their work, a motivating work environment, high incentives and fair pay. We know all of this. Yet, when it comes to the absolute core of the application, the training data, all of this knowledge seems to go straight out of the window.
Don’t expect great data if you’re boring the shit out of underpaid people.
The problems with Mechanical Turk are not a secret and there have been many attempts at designing around them. But instead of “designing around” underpaid people doing boring work with bad incentives, data collection should receive the same treatment as all other human-facing interactions. Imagine all the knowledge you’d be able to collect if you spent as much time on your data collection process as you did on, say, the user onboarding flow of your app.
Solution #1: UX-driven data collection with active learning
When humans interact with machines, their experience is what decides about the success of the interaction. I’ve already talked about some of this in my post on the importance of front-end development for AI:
If your tools are bad, the task will be boring and frustrating, and as a result, the workers’ input quality will be low. Believing that you can make your annotation tools bad because labour should be cheap is like believing that your offer is so valuable that your users shouldn’t care if your website is confusing and hard to use.
— How front-end development can improve Artificial Intelligence
The more time, clicks and effort it requires to complete a task, the less efficient the result. Even the most subtle changes to a user interface can have a noticable impact, for instance on converting visitors to users or users to paying customers. There’s a reason why Tinder doesn’t ask you to type in a comma-separated list of the full names of everyone you’d like to talk to. Their card-based UI reduces a complex interaction to one intuitive motion: swipe left and swipe right. It’s so effective because it reduces friction between the user and the interface.
Human time and attention is precious. Instead of presenting the annotators with a span of text that contains an entity, asking them to highlight it, select one of many labels from a dropdown and confirm, you can break the whole interaction down into a simple binary decision. You’ll have to ask more questions to get the same information, but each question will be simple and focused. You’ll collect more user actions, giving you lots of smaller pieces to learn from, and a much tighter feedback loop between the human and the model. You don’t need to ask the questions using a fixed queue. As the human clicks through the examples, you can prioritise the questions, using the current state of the model. This puts the computer in charge of what it’s good at – memory and consistency.
The above graphic shows an annotation project using this active learning approach. Instead of asking the human to annotate a fixed batch of tasks, the model selects a task and presents it to the human for annotation. The annotated single task can then immediately adjust the model’s internal weights and influence its choice of what to ask next. A simple policy is to ask what the model is least sure about, although a range of other strategies have been explored.
Solution #2: Transfer learning with general-purpose models
Regardless of the application you’re building, it will always require some general knowledge about language and the world, from basic grammar to a variety of phrases, expressions and entities. Think of it like hiring a new junior employee – you know you’ll have to spend time teaching them about the specifics of the work and you don’t expect them to know everything right away. But you do want them to speak a language and know how to use a computer. You’re not going to raise them from birth.
Pre-trained models let you jump-start your application with general information, which you can then fine-tune and improve to fit your custom needs. With deep learning, you can even chain together multiple models, and adjust the whole pipeline based on the eventual, task-specific error. Normally, the reuseable components will be the lowest, least abstract layers of the network. For instance, consider the following sketch of an intent recognition model:
This model takes a user input and returns an intent, i.e. a prediction of what the user might want. The example input, “whats the best way to catalinas”, could mean a lot of things. In a social app, the user might want to get directions to their friend Catalina’s house. A restaurant app might want to show the local fast food restaurant, Catalinas. Or maybe you’re building a Latin American travel app and your user wants to book a trip to the beach town Las Catalinas in Puerto Rico.
To figure this out, the model computes a series of internal representations, which encode generalisable information, about the language and the world. For example, the phrase meanings (likely computed using a CNN or RNN layer) will let the model determine that “best way” is an expression that means something much more specific than the sum of its parts. The entity labels can take care of assigning the correct label to “catalinas” to figure out whether it’s a person, a place, a geopolitical entity or something entirely different.
But what if the intent recogniser assigns the label PERSON
to “catalinas”,
when the user was actually interested in a local fast food restaurant? If you
own the weights of your model, you’re not only able to fix the mistaken output,
but also correct all the wrong assumptions that led to it. Transfer learning is
especially advantageous if you have full write access to your model. If you
don’t — for instance, if you’re consuming a Cloud API — you have to settle for
less principled and much less effective approaches to fine-tuning.
Conclusion
If you’re using SaaS provider and only get access via an API, your model will essentially stay a black box. You’re not easily able to backpropagate to correct the model’s internal representations. All you can do is add workarounds for your read-only model or retrain it with more examples. This challenges the current trends towards thin clients and centralised cloud computing, and is one of many reasons to bet against Machine Learning as a Service.
Machine Learning as a Service is an idea we’ve been seeing for nearly 10 years and it’s been failing the whole time. The bottom line on why it doesn’t work: the people that know what they’re doing just use open source, and the people that don’t will not get anything to work, ever, even with APIs.
— Bradford Cross, Five AI Startup Predictions for 2017
Supervised learning is only just getting good, and with transfer learning and fully differentiable models it’s constantly getting better. It doesn’t make sense to give up on controlling the model’s output just because our annotation tooling and processes suck. There’s definitely a problem here, but it’s not the concept of supervised learning. At Explosion, we’re making this one of our priorities, and we’re looking forward to sharing the results.