Ines Montani on Natural Language ProcessingSoftware Engineering RadioInes speaks with host Jeremy Jung about solving problems using natural language processing. They cover generative vs. predictive tasks, creating a pipeline and breaking down problems, labeling examples for training, fine-tuning models, using LLMs to label data and build prototypes, and the spaCy NLP library.
Zero-Shot NER with GliNER and spaCy Python Tutorials for Digital HumanitiesTutorial by WJB Mattingly on how to integrate the generalist GLiNER model for Named Entity Recognition with spaCy's versatile NLP environment.
Describing Images Fast and Slow: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic ProcessesTakmaz, Pezzelle, Fernández (2024)We use the spaCy library for tokenization, part-of-speech tagging, and lemmatization of the words in the descriptions.
Prodigy-Segment for Pixel SegmentationUse Meta’s “Segment Anything” model in Prodigy to help you select the right pixels in images.
Neuradicon: operational representation learning of neuroimaging reportsWatkins, Gray, Julius, Mah, Pinaya, Wright, Jha, Engleitner, Cardoso, Ourselin, Rees, Jaeger, Nachev (2023)Labelled data for each task was produced using the Prodigy labelling tool. Each report was labelled in a paired-annotation manner. [...] We used the grammatical dependency parse produced by the spaCy parser as input and implemented the patterns using the spaCy dependency matcher.
calamanCy: A Tagalog Natural Language Processing ToolkitMiranda (2023), EMNLP 2023We introduce calamanCy, an open-source toolkit for constructing NLP pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks.
Half hour of labeling power: Can we beat GPT?PyData NYCLarge Language Models (LLMs) offer a lot of value for modern NLP and can typically achieve surprisingly good accuracy on predictive NLP tasks. But can we do even better than that? In this workshop we show how to use LLMs at development time to create high-quality datasets and train specific, smaller, private and more accurate models for your business problems.
How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?Generative AI SummitHow does in-context learning compare to supervised approaches on predictive tasks? How many labelled examples do you need on different problems before a BERT-sized model can beat GPT-4 in accuracy? The answer might surprise you: models with fewer than 1b parameters are actually very good at classic predictive NLP, while in-context learning struggles on many problem shapes.
✨ prodigy v1.14.5Oct 24, 2023Toggle for character vs. token highlighting, CSS and JS from local and remote paths
🤖 curated-transformers v1.3.0Oct 2, 2023Custom model repositories, NVTX Ranges, store config in models
MP Interests Tracker: Utilising GenAI to uncover insights in the UK Register of Financial InterestJournalismAI BlogProject from teams at The Times and BBC using spacy-llm to make complex financial interests data more accessible.
Models as annotators in ProdigyHow to use models and LLMs as annotators to find disagreements and prioritize examples to annotate first.
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMsQCon London
Constructing a knowledge base with spaCy and spacy-llmMantisNLP BlogThis blog post shows how to use spaCy and LLMs to extract entities and relationships from text and quickly tackle the complex problem of constructing a knowledge base graph from a corpus.
KAZU v1.5A biomedical NLP framework designed to handle production workloads, built by AstraZeneca and Korea University and using spaCy under the hood.
Herding LLMs Towards Structured NLPGlobal AI ConferenceThis talk shows how we integrate LLMs into spaCy, leveraging its modular and customizable framework. This allows for cheaper, faster and more robust NLP - driven by cutting-edge LLMs, without compromising on having structured, validated data.
Who said what: using machine learning to correctly attribute quotesThe Guardian Engineering BlogHow the Guardian uses spaCy and Prodigy to train a custom coreference resolution model.
Developing a Named Entity Recognition Dataset for TagalogMiranda (2023), IJCNLP-AACL 2023We used Prodigy as our annotation tool. We set up a web server on the Google Cloud Platform and routed the examples through Prodigy’s built-in task router.
GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignmentFrei, Frei-Stuber, Kramer (2023), Journal of Biomedical InformaticsThe training of our entity recognition model employs the entity recognition parser from the spaCy library which follows a transducer-based parsing approach with a BILOU scheme instead of a state-agnostic token tagging approach.
Prodigy-PDF for PDF annotation and OCRWant to annotate PDF files? Our new Prodigy plugin can help with that! To explain how to use PDF segmentation and OCR, Vincent made a small demo video.
Panel: Large Language ModelsBig PyData BBQwith Ines, Alejandro Saucedo (Zalando, Institute for Ethical AI & ML), Alina Lehnhard (Cerence), Michael Gerz (Heidelberg University), Alexander CS Hendorf (Königsweg)
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMsPyCon Lithuania KeynoteWith the latest advancements in NLP and LLMs, and big companies like OpenAI dominating the space, many people wonder: Are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies?
T-RAG: Lessons from the LLM TrenchesFatehkia, Lucas, Chawla (2024)An important application area is question answering over private enterprise documents where the main considerations are data security, which necessitates applications that can be deployed on-prem, [and] limited computational resources. [...] In addition to retrieving contextual documents, we use the spaCy library with custom rules to detect named entities from the organization.
spacy-llm: From quick prototyping with LLMs to more reliable and efficient NLP solutionsAstraZeneca NLP Community of PracticeLLMs are paving the way for fast prototyping of NLP applications. Here, Sofie showcases how to build a structured NLP pipeline to mine clinical trials, using spaCy and spacy-llm. Moving beyond a fast prototype, she offers pragmatic solutions to make the pipeline more reliable and cost efficient.
On the Creation of Classifiers to Support Assessment of E-PortfoliosGantikow, Isking, Libbrecht, Müller, Rebholz (2023)In this workflow, Prodigy selects and presents text examples that were classified with a very low degree of certainty. The annotator reviews the proposed classifications and corrects them, if necessary.
Launching the Explosion Merch StoreSpread the love and support us and our open-source work with some of our unique, custom-designed swag. All orders come with free shipping and stickers!
Introducing Prodigy-HFHugging Face BlogLast week, Explosion introduced Prodigy-HF, a new Prodigy plugin offering code recipes that directly integrate with the Hugging Face stack.
Prodigy-ANN for Image Retrieval via CLIPDealing with a huge bucket of images that you want to annotate? The new image retrieval features in Prodigy-ANN (approximate nearest neighbors) might help!
Toward a Critical Toponymy Framework for Named Entity Recognition: A Case Study of Airbnb in New York CityBrunila, LaViolette, CH-Wang, Verma, Féré, McKenzie (2023), EMNLP 2023All annotation was performed using Prodigy following an initial training session where annotators collaboratively annotated a randomly chosen set of samples.
✨ prodigy v1.14.3Oct 6, 2023Inter-annotator agreement for document-level and token-level annotations, new plugins
scispacy v0.5.3A Python package containing spaCy models for processing biomedical, scientific or clinical text, developed by AI2.
Designing for tomorrow’s programming workflowsPyCon LithuaniaModern editors and AI-powered tools like GitHub Copilot and ChatGPT are changing how people program and are transforming our workflows and developer productivity. But what does this mean for how we should be writing and designing our APIs and libraries?
How Nesta uses NLP to process 7m job ads and shed light on the UK’s labor marketA case study on Nesta’s workflow for extracting 7 million job ads to better understand UK skill demand, using a custom mapping step to match skills to any government taxonomy.
Microsoft Presidio v2.2.352Context aware, pluggable and customizable PII de-identification and anonymization service for text and images, featuring a spaCy back-end.
Prodigy in 2023: LLMs, task routers, QA and pluginsWe have made a ton of new updates in Prodigy this year with v1.12, v1.13, and v1.14 releases. So we decided to write a post about them.
Impoliteness and morality as instruments of destructive informal social control in online harassment targeting Swedish journalistsBjörkenfeldt, Gustafsson (2023)In the annotation tool Prodigy used for this process, the tweets directed towards journalists were displayed alongside the initial tweet that initiated the conversation thread and the subsequent reply from the journalist.
State-of-the-Art Transformer Pipelines in spaCyaiGrunnIn this talk, we will show you how you can use transformer models (from pretrained models such as XLM-RoBERTa to large language models like Llama2) to create state-of-the-art annotation pipelines for text annotation tasks such as named entity recognition.
Identifying Signs and Symptoms of Urinary Tract Infection from Emergency Department Clinical Notes Using Large Language ModelsIscoe, Socrates, Gilson, Chi, Li, Huang, Kearns, Perkins, Khandjian, Taylor (2023)For annotation we employed Prodigy, a scriptable annotation tool designed to maximize efficiency, enabling data scientists to perform the annotation tasks themselves and facilitating rapid iterative development in natural language processing (NLP) projects.
Newsletter September 2023The latest edition of our newsletter, featuring our plans for premium models, LLMs, chain-of-thought prompting, upcoming events and talks, and exciting new Prodigy features. Plus exclusive discounts!
🦙 spacy-llm v0.5.0Sep 8, 2023Improved user API and novel Chain-of-Thought prompting for more accurate NER
Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical TextsDannenfelser, Zhong, Zhang, Yao (2023), NeurIPSTissue, cell type, tool, and method were annotated using the Prodigy software tool developed by Explosion AI for easy tracking of token-level tags.