© Frederique Matti

Prodigy: A new tool for radically efficient machine teaching

by Matthew Honnibal & Ines Montani on

Machine learning systems are built from both code and data. It's easy to reuse the code but hard to reuse the data, so building AI mostly means doing annotation. This is good, because the examples are how you program the behaviour – the learner itself is really just a compiler. What's not good is the current technology for creating the examples. That's why we're pleased to introduce Prodigy, a downloadable tool for radically efficient machine teaching.

We've been working on Prodigy since we first launched Explosion AI last year, alongside our open-source NLP library spaCy and our consulting projects (it's been a busy year!). During that time, spaCy has grown into the most popular library of its type, giving us a lot of insight into what's driving success and failure for language understanding technologies. Most of those insights have been used to make spaCy better: AI DevOps was hard, so we made sure models could be installed via pip. Large models made CI tricky, so the new models are less than 1/10th the size.

Prodigy addresses the big remaining problem: annotation and training. The typical approach to annotation forces projects into an uncomfortable waterfall process. The experiments can't begin until the first batch of annotations are complete, but the annotation team can't start until they receive the annotation manuals. To produce the annotation manuals, you need to know what statistical models will be required for the features you're trying to build. Machine learning is an inherently uncertain technology, but the waterfall annotation process relies on accurate upfront planning. The net result is a lot of wasted effort.

Prodigy solves this problem by letting data scientists conduct their own annotations, for rapid prototyping. Ideas can be tested faster than the first planning meeting could even be scheduled. We also expect Prodigy to reduce costs for larger projects, but it's the increased agility we're most excited about. Data science projects are said to have uneven returns, like start-ups: a minority of projects are very successful, recouping costs for a larger number of failures. If so, the most important problem is to find more winners. Prodigy helps you do that, because you get to try things much faster.

Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible, and try to guess the rest. Prodigy puts the model in the loop, so that it can actively participate in the training process and learns as you go. The model uses what it already knows to figure out what to ask you next. As you answer the questions, the model is updated, influencing which examples it asks you about next. In order to take full advantage of this strategy, Prodigy is provided as a Python library and command line utility, with a flexible web application. There's a thin, and optional hosted component to make it easy to share annotation queues, but the tool itself is entirely under your control.

prodigy dataset news_headlines "Annotate entities in news headlines"✨ Created dataset 'news_headlines'.prodigy ner.teach news_headlines en_core_web_sm "Silicon Valley" --api nyt✨ Starting the web server on port 8080...Open the app in your browser and start annotating!
If a Bubble Bursts in Palo Alto gpe, Does It Make a Sound?
source: The New York Times
Why not cloud?Active learning works best when you have a lot of raw input to stream through the model, so that more informative examples can be chosen for annotation. The model must be updated during the annotation session, and the updates must be specific to each user. Solutions to these problems could surely be developed – but… why? As attractive as SaaS is to investors, it only makes sense if the hosted component is adding value, instead of removing it.

Prodigy comes with built-in recipes for training and evaluating text classification, named entity recognition, image classification and word vector models. There's also a neat built-in component for doing A/B evaluations, which we expect to be particularly useful for developing generative models and translation systems. To keep the system requirements to a minimum, data is stored in an SQLite database by default. It's easy to use a different SQL backend, or to specify a custom storage solution.

REST API Web app Command-line interface Data stream Database Model state Recipe

The components are wired togther into a recipe, by adding the @recipe decorator to any Python function. The decorator lets you invoke your function from the command-line, as a prodigy subcommand. Recipes can start the web service by return a dictionary of components. The recipe system provides a good balance of declarative and procedural approaches. If yo just need to wire together built-in components, return a Python dictionary is no more typing than the equivalent JSON representation. But the Python function also lets you implement more complicated behaviours, and reuse logic across your recipes.

recipe.pyimport prodigy
import your_arbitrary_ETL_logic

@prodigy.recipe('custom_stream', dataset=("Dataset"), db=("Database"), label=("Label", "option"))
def custom_stream(dataset, db=None, label=''):
    DB = your_arbitrary_ETL_logic.load(db)
    return {
        'dataset': dataset,
        'stream': ({'text': row.text, 'label': label} for row in DB)
        'view_id': 'classification'
    }

When humans interact with machines, their experience is what decides about the success of the interaction. Most annotation tools avoid making suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible. The more complicated the structure your model has to produce, the more benefit you can get from Prodigy's binary interface. The web app lets you annotate text, entities, classification, images and custom HTML tasks straight from your browser – even on mobile devices.

The Prodigy web applicationTry the live demoTo see the web application and different annotation interfaces in action, check out the Prodigy live demo.

Human time and attention is precious. Instead of presenting the annotators with a span of text that contains an entity, asking them to highlight it, select one of many labels from a dropdown and confirm, you can break the whole interaction down into a simple binary decision. You'll have to ask more questions to get the same information, but each question will be simple and focused. You'll collect more user actions, giving you lots of smaller pieces to learn from, and a much tighter feedback loop between the human and the model.

Most AI systems today rely on supervised learning: you provide labelled input and output pairs, and get a program that can perform analogous computation for new data. Supervised learning algorithms have been improving quickly, leading many people to anticipate a new wave of entirely unsupervised algorithms: algorithms so "advanced" they can compute whatever you want, without you specifying what that might be. This is like hoping for a programming language so advanced you don't even need to write a program.

Unsupervised algorithms return meaning representations, based on the internal structure of the data. By definition, you can't directly control what the process returns. Sometimes the meaning representation will directly address a useful question. If you're looking for suspicious activity on your platform, you might find that an outlier detection process is all you need. However, the unsupervised algorithm won't usually return clusters that map neatly to the labels you care about. With the right feature weightings, you might be able to come up with a model that sorts your data more usefully, but doing this by hand is unproductive: this is exactly the problem supervised learning is designed to solve.

Text classification models can be trained to perform a wide variety of useful tasks, including sentiment analysis, chatbot intent detection, and flagging abusive or fraudulent content. One of the problems with text classification is that it's usually hard to guess how accurately the system will perform. Some problems turn out to be unexpectedly easy, while others are so difficult the intended functionality needs to be redesigned. Prodigy lets you perform very rapid prototyping, so that you can quickly find out which ideas are worth further exploration.Workflow and data setFor more info on how to do text classification with Prodigy, see the detailed text classification workflow. You can also download the annotated data set we've created with Prodigy for this example.

Text classification really shines when the task would otherwise be performed by hand. For instance, we regularly categorise GitHub issues for our library, spaCy. Keeping the issue tracker tidy is something many open source projects struggle with – so automated tools could definitely be helpful. How easy would it be to create a bot to tag the issues automatically?

Prodigy is a Python library, so it's easy to stream in data from any source — all you have to do is create a generator that yields out your examples. Prodigy also includes several built-in API loaders, including one for the GitHub API. To get started, we'll want to search for a query that returns a decent number of documentation issues. The model can't know what we're looking for until we've said "yes" to some examples. To find a good query, it's useful to pipe the stream into less, so we can look at the results:

prodigy textcat.print-stream "docs" --api github --label DOCS | less -r0.00 DOCS Describe streaming arguments in docs0.00 DOCS Salesforce connector0.00 DOCS Node.js: Produce docs files0.00 DOCS Can't understand what is this page saying....0.00 DOCS Unable to run rook with dataDirHostPath0.00 DOCS Replace tutorials in the user guide with links to the guides

Now it's time to start annotating. We first add initialise a new dataset, adding a quick description for future reference. The next command starts the annotation server. The textcat.teach subcommand tells prodigy to run the built-in recipe function teach(), using the rest of the arguments supplied on the command line.Custom recipesThe subcommand system is fully extensible. All you have to do is add the @recipe decorator to your function, and you'll be able to call it from the command line. To start the annotation server, your recipe just has to return a dictionary of components, like the stream of examples, the annotation interface, and optional callbacks to update and save your model.

prodigy dataset gh_issues "Classify issues on GitHub"✨ Created dataset 'gh_issues'.prodigy textcat.teach gh_issues en_core_web_sm "docs" --api github --label DOCS✨ Starting the web server on port 8080...

Opening localhost:8080, we get a sequence of recent GitHub issue titles, displayed with our category as the title. If the category is correct, click accept, press a, or swipe left on a touch interface. If the category does not apply, click reject, press x, or swipe right. Some examples are unclear or exceptions that you don't want the model to learn from. In these cases, you can click ignore or press space.

DOCS
Improve documentation for service and token urls
DOCS
Native library fails to load in docker container

Prodigy trains a model during annotation, on the answers you're providing. This lets Prodigy rank the examples in the stream, to ask less redundant questions. Learning from streaming data is a tricky problem, so we can usually get better results by training a new from scratch, once all the annotations are collected. This also lets us study the model in more detail, and try different hyper-parameters.

After around 40 minutes of annotating the stream of issue titles for the search queries "docs", "documentation", "readme" and "instructions", we end up with a total of 830 annotations that break down as follows:

DecisionCount
accept261
reject525
ignore44
Total830
prodigy textcat.print-dataset gh_issues | less -r0.63 DOCS Describe streaming arguments in docs0.56 DOCS Salesforce connector0.39 DOCS Node.js: Produce docs files0.12 DOCS Can't understand what is this page saying....0.17 DOCS Unable to run rook with dataDirHostPath

By default, Prodigy uses spaCy v2.0's new text classification system (currently in alpha). The model is a convolutional neural network stacked with a unigram bag-of-words. The bag-of-words model learns quickly, while the convolutional network lets the model pick up cues from longer phrases, once a few hundred examples are available.

Using a different text classification strategy with Prodigy is very easy. If you want to keep using spaCy, you can simply pass a new model instance to the TextClassifier component. For an entirely custom NLP solution, you only need to provide two functions: one which assigns scores to the text, and another which updates the model on a new batch of examples. If your text classification solution only supports batch training, you can use the built-in model during annotation, and then export the annotations to train your solution separately.

Within the first hour of annotation, the system classified 140 out of the 156 evaluation issues correctly. To put this into some context, we have to look at the class balance of the data. In the evaluation data, 65% of the examples were labelled reject, i.e. they were tagged as not documentation issues. This gives a baseline accuracy of 65%, which the classifier easily exceeded. We can get some sense of how the system will improve as more data is annotated by retraining the system with fewer examples.

prodigy textcat.train-curve gh_issues --label DOCS --eval-split 0.2 % ACCURACY 25% 0.73 +0.73 50% 0.82 +0.09 75% 0.84 +0.02 100% 0.87 +0.03
Interpreting the curveEach row of the table shows an experiment where the model was evaluated on 20% of the data, and trained with a subset of the remaining examples. This lets you see the relationship between the data set size and accuracy, so you can predict how much accuracy might improve if you collect more annotations.

The chart shows the accuracy achieved with 10%, 25%, 50% and 75% of the training data. The last 25% of the training data brought 3% improvement in accuracy, indicating that further training will improve the system. Similar logic is used to estimate the progress indicator during training.

After training the model, Prodigy outputs a ready-to-use spaCy model, making it easy to put into production. spaCy comes with a handy package command that converts a model directory into a Python package, allowing the data dependency to be specified in your requirements.txt. This gives a smooth path from prototype to production, making it easy to really test the model, in the context of a larger system.

prodigy textcat.batch-train gh_issues /tmp/gh_docs --label DOCSspacy package /tmp/gh_docs /tmpcd /tmp/gh_docspython setup.py sdistpip install dist/gh_docs-1.0.0.tar.gzInstalled package gh_docs.
Usage in spaCy v2.0 alpha+import gh_docs
nlp = gh_docs.load()
texts = ['missing documentation',
         'docker container not loading',
         'installation not working on windows']
for doc in nlp.pipe(texts):
    print(doc.cats)

# {'DOCS': 0.9812000393867493}
# {'DOCS': 0.005252907983958721}
# {'DOCS': 0.0033084796741604805}

If annotation projects are expensive to start, you have to guess which ideas look promising. These guesses will often be wrong, because it's difficult to predict the performance of a statistical model before the data has been collected. Prodigy helps you break through this bottleneck by dramatically reducing the cost of investigating new ideas. The whole annotation process is cheaper with Prodigy, but it's the time-to-first-evidence that's most important. There's no shortage of ideas that would be incredibly valuable if they could be made to work. The shortage is in time to investigate those opportunities – which is exactly what Prodigy gives you more of.

Apply for the Prodigy beta

The beta program is free – you'll only have to answer a few questions upfront, and give us some feedback afterwards. We're looking for a good mix of beta testers from different backgrounds and with different levels of annotation experience. Please understand that we only have limited spots available — but everyone who applies will receive an exclusive discount of 20% when Prodigy v1.0 becomes available. For more details, see the Prodigy website.
Matthew Honnibal
About the Author

Matthew Honnibal

Matthew is a leading expert in AI technology, known for his research, software and writings. He completed his PhD in 2009, and spent a further 5 years publishing research on state-of-the-art natural language understanding systems. Anticipating the AI boom, he left academia in 2014 to develop spaCy, an open-source library for industrial-strength NLP.
Ines Montani
About the Author

Ines Montani

Ines is a developer specialising in web applications for AI technology. She's a core developer of the spaCy Natural Language Processing library and Prodigy, an annotation tool for radically efficient machine teaching. Before founding Explosion AI, she was a freelance front-end developer and strategist, using her four years executive experience in ad sales and digital marketing.

Read more