Go to top

Serverless custom NLP with LLMs, Modal and Prodigy

Human-in-the-loop distillation with Large Language Models (LLMs) provides a scalable way to leverage use-case-specific data to build custom information extraction systems. But compared to an entirely prompt-based workflow, there’s still a major bottleneck: you actually need to create high-quality data and you need to train, ideally on GPU. So how can we make this easier?

There’s no push-button one-size-fits-all solution to data creation. It’s a development activity, so what you need are interactive tools. Our annotation tool Prodigy gives you a great suite of interfaces and utilities for this, and it brings the functionality to your local development environment. However, you don’t always want to run everything locally. This is where Modal comes in.

Modal is a serverless cloud platform that provides high-performance and on-demand computing for AI models, large batch jobs and anything else you want to run in the cloud, without having to worry about infrastructure. It’s fully scriptable in Python and comes with a first-class developer API and CLI, so you can easily deploy any code you’re working on.

In this blog post, we’ll show you how you can go from an idea and little data to a fully custom information extraction model using Prodigy and Modal, no infrastructure or GPU setup required. We’ll use a Large Language Model (LLM) to automatically create annotations for us, correct mistakes to improve the predictions, and use transfer learning to fine-tune our own task-specific component.

To celebrate the release of the Modal plugin, we’re offering a limited discount for Prodigy 🎉 Get 20% off the Company License (includes the plugin) with the code PRODIGYMODAL20 at checkout. Valid until the end of November 2024.

Getting started

After installing Prodigy, the plugin and the Modal client, you can run python -m modal setup to authenticate. To persist any data we create and access it both locally and in the cloud, we also need to set up a database. Modal itself doesn’t provide database services, but Neon is just as convenient to set up and gives you an instantly provisioned serverless PostgreSQL instance. If you already have a remote database set up elsewhere, you can of course also use that instead.

In your prodigy.json, you can then configure Prodigy to use it everywhere. If you don’t yet have a Postgres driver installed locally, you can do so by running pip install psycopg2-binary.

prodigy.json

{
"db": "postgresql",
"db_settings": {
"postgresql": {
"dbname": "neondb",
"user": "neondb_owner",
"password": "XXXXXX",
"host": "your-neon-host.aws.neon.tech"
}
}
}

Example use case

For the examples in this post, we’ll be using a dataset of GitHub issues scraped from the GitHub API to build a classifier to predict whether the issue is about DOCUMENTATION or a BUG. However, if you’re following along and want to try it out for yourself, feel free to adapt it to use your own labels and data.

gh_issues_raw.jsonl (excerpt)

{"text":"# Add Documentation\n\n","html":"<h1>Add Documentation</h1>\n","title":"Add Documentation","body":"","meta":{"source":"GitHub","url":"https://github.com/unicornsden/pixie/issues/9"}}
{"text":"Samples erroring when loaded locally","html":"","title":"","body":"","meta":{"source":"GitHub","url":"https://github.com/beakable/isometric/issues/20"}}
{"text":"# UPDATE CRONTAB\n\nI am running a process via crontab once a day , \r\nnow I want to run another process once a day , \r\n\r\nhow do i update my crontab on dokku ?\r\n\r\nroot@AmzBotD:~# dokku run test1 crontab -l\r\nno matching process entry found\r\nno crontab for herokuishuser\r\n\r\n","html":"<h1>UPDATE CRONTAB</h1>\n\n<p>I am running a process via crontab once a day , \nnow I want to run another process once a day , </p>\n\n<p>how do i update my crontab on dokku ?</p>\n\n<p>root@AmzBotD:~# dokku run test1 crontab -l\nno matching process entry found\nno crontab for herokuishuser</p>\n","title":"UPDATE CRONTAB","body":"I am running a process via crontab once a day , \r\nnow I want to run another process once a day , \r\n\r\nhow do i update my crontab on dokku ?\r\n\r\nroot@AmzBotD:~# dokku run test1 crontab -l\r\nno matching process entry found\r\nno crontab for herokuishuser\r\n\r\n","meta":{"source":"GitHub","url":"https://github.com/dokku/dokku/issues/3638"}}
Download gh_issues_raw.jsonl

Creating the dataset

Prodigy’s design is centered around “recipes”, Python functions that orchestrate the annotation and data workflows. This scriptability enables powerful automation: with better and better general-purpose models available, we don’t need humans to perform boring click work. We also don’t necessarily need “big data” – a small, high-quality dataset and BERT-sized embeddings can achieve great performance at low operational overhead.

Instead of using a big general-purpose LLM at runtime for a specific, self-contained task, you can move the dependency to the development process and use it to create data for a smaller, faster and more accurate model you can run in-house. The more you can let the model do for you, the better. As shown in our recent case study with S&P Global, their team was able to achieve F-scores of up to 99% at 16k words per second with only 15 person hours of data development work including training and evaluation. At PyData NYC, we only needed 8 person hours to beat the few-shot GPT baseline. And with better models and smarter workflows, we’ll likely see these numbers go even lower in the future.

To set up the automated labelling, we can create a .env file for the required keys (in this case, Prodigy and OpenAI) and an llm.cfg config file for the LLM and information extraction task. Here we can provide both the label names, as well as optional label definitions to give the prompt more context.

.env

PRODIGY_LICENSE_KEY="XXXX-XXXX-XXXX-XXXX"
PRODIGY_LOGGING="basic"
OPENAI_API_ORG = "org-..."
OPENAI_API_KEY = "sk-..."

assets/llm.cfg (excerpt)

[components.llm.task]
@llm_tasks = "spacy.TextCat.v3"
labels = ["DOCUMENTATION", "BUG"]
exclusive_classes = false
[components.llm.task.label_definitions]
DOCUMENTATION = "Issue about technical documentation"
BUG = "Issue about a fault in the software"
[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}
Download llm.cfg

Auto-generating the annotations

Prodigy’s textcat.llm.fetch recipe now pre-annotates the raw input data and saves the structured results to the dataset gh_issues_docs. Datasets can be loaded back in for review and corrections, and also used out-of-the-box to train, fine-tune and evaluate models.

Pre-compute annotations from an LLM locally

dotenv run -- prodigy textcat.llm.fetch ./data/llm.cfg ./data/gh_issues_raw.jsonl dataset:gh_issues_docs

This is a perfect use case for Modal, since it’s a long-running process that’s less inconvenient to have on your local machine, but it doesn’t require expensive compute resources. With modal.run, you can outsource it to the cloud!

Pre-compute annotations from an LLM in the cloud

dotenv run -- prodigy modal.run "textcat.llm.fetch ./data/llm.cfg ./data/gh_issues_raw.jsonl dataset:gh_issues_docs" --assets ./data --detach

The --assets argument lets you provide a directory of files required for the workflow, e.g. the raw input data, models or code for custom recipes, and in this case, the LLM config file. Setting --detach ensures that the process keeps running, even if you close your local terminal. The annotations created by the model are saved to the dataset gh_issues_docs and stored in the remote PostgreSQL database, which you can also access locally.

Modal dashboard
The running process in the Modal dashboard

By running db-out, we can inspect the created structured data. Each example contains the available options, as well as the accepted labels.

Pre-annotated example (excerpt)

{
"text": "# Add Documentation\n\n",
"html": "<h1>Add Documentation</h1>\n",
"title": "Add Documentation",
"options": [
{"id": "BUG", "text": "BUG"},
{"id": "DOCUMENTATION", "text": "DOCUMENTATION"}
],
"accept": ["DOCUMENTATION"]
}

Improving data and model quality

We’ve now used a model to automatically create data for us and depending on the LLM, use case and labels, we may get decent-quality data out-of-the box. But chances are the model has made mistakes, which we don’t want to replicate in our custom distilled component. We all know that we should spend more time with our data, and Prodigy actually makes this achievable by giving you efficient workflows to try out ideas before you scale up the process.

Since we have local access to the database, we can load the labelled data back in with Prodigy and view it in the browser. The textcat.manual recipe lets you stream in the data with the model’s predictions pre-selected.

Correcting pre-computed annotations locally

prodigy textcat.manual gh_issues_docs_reviewed dataset:gh_issues_docs --label DOCUMENTATION,BUG

You can now make corrections if needed by clicking the labels or using the keyboard shortcuts 1 and 2, and hit the accept button or A key to move to the next example. The examples are then saved to a new and improved dataset gh_issues_docs_reviewed.

Annotation interface in Prodigy
Example with pre-selected options in the annotation interface

Optional: Deploying the annotation server

Under the hood, Prodigy starts a web server, which you can run locally on your machine, serve on an internal network (great for privacy-sensitive use cases) or deploy to the cloud. Modal makes it easy to host the annotation app for others on your team with a single command, and you can optionally configure basic authentication or Single Sign-On (SSO) for different users if needed.

Deploy app to correct pre-computed annotations in the cloud

dotenv run -- prodigy modal.deploy "textcat.manual gh_issues_docs_reviewed dataset:gh_issues_docs --label DOCUMENTATION,BUG"

Training the component on GPU

For this example, we’ll use RoBERTa-base to initialize the model, which gives us a good basis for higher accuracy when fine-tuning the task-specific component on top of it. You can of course customize the transformer embeddings and hyperparameters via the spaCy training config. The train command will then train the component using the provided dataset.

Using modal.run, you can send this process off to the cloud with --require-gpu, optionally specifying the GPU type with --modal-gpu to automatically provision a GPU machine with the required dependencies for you. Setting --detach ensures that the process keeps running, even if you close your local terminal.

Fine-tune a model on the data

dotenv run -- prodigy modal.run "train --textcat-multilabel gh_issues_docs_reviewed --config ./data/roberta.cfg --gpu-id 0" --assets ./data --require-gpu --modal-gpu a100 --detach
Download roberta.cfg

In the Modal dashboard, you can view all running apps and logs, including the training progress.

Logs (excerpt)

Training: 883 | Evaluation: 220
Labels: textcat (2)
ℹ Pipeline: ['transformer', 'textcat_multilabel']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS TEXTC... CATS_SCORE SCORE
--- ------ ------------- ------------- ---------- ------
0 0 0.00 0.49 48.65 0.49
5 200 192.03 124.52 90.06 0.90
11 400 55.19 34.03 89.80 0.90
16 600 37.94 30.11 91.09 0.91

By default, a Modal volume named prodigy-vol is created at the remote path /vol/prodigy_data/ and any models you train are stored in a folder models. The Modal CLI lets you interact with your volumes and download any created files locally:

Download model from Modal volume

modal volume get prodigy-vol models/model-best ./
⠙ Downloading file(s) to local...
Downloading file(s) to local... 0:00:11 ━━━━━━━━━━━━━━━ (12 out of 13 files completed)
model-best/transformer/model ━━━━━━━━━━━━━━━ 0.0% • 0.0/502.0 MB • ? • -:--:--

The result is a 513 MB task-specific model you can easily run and deploy yourself:

Use the model

import spacy
nlp = spacy.load("./model-best")
doc = nlp("Add an Instructions block to the top of all new projects")
print(doc.cats)
# {'BUG': 0.00015601300401613116, 'DOCUMENTATION': 0.9999405145645142}
doc = nlp("Update the Java/Android library to parallel the .NET version")
print(doc.cats)
# {'BUG': 0.008889737538993359, 'DOCUMENTATION': 1.9018450984731317e-05}

Using LLMs to create data for task-specific models is currently a pretty underrated usage pattern that holds a lot of potential for building more resilient and transparent NLP applications, without requiring expensive cloud resources or labour-intensive annotation.

We’ll continue working on better developer tooling around these workflows, and with new models and more convenient infrastructure tools like Modal, we’ll likely see applied NLP becoming even more efficient in the future. If you end up trying out Prodigy and Modal for your use case, let us know how you go on the forum!

Resources