Explosion · Developer tools and consulting for AI, Machine Learning and NLP

Explosion builds developer tools for AI, Machine Learning and Natural Language Processing. →
Consulting

Project

Topics

Category

Tasks

Authors

How AI is reshaping IT skills

How AI is reshaping IT skills connect professional

German article featuring Ines’ take on the impact of AI on future-proof skills for IT professionals.

E^2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness

E^2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness Zhao, Zhu, Guo, He, Li (2025)

Instead of using LLMs for entity extraction, we employ the traditional NLP tool spaCy to extract entities, and use their co-occurrence in a chunk as relations.

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content Rizvi, Navojith, Adhikari, Senevirathna, Kasthurirathna, Abeywardhana (2025)

Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned spaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%.

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

A case study on Love Without Sound’s innovative AI-powered tools for the music industry and law firms specializing in royalty negotiations.

📚 spacy-layout v0.0.12Mar 8, 2025

Support processing PDFs with context, add document index tables and more docs

What the history of the web can teach us about the future of AI

What the history of the web can teach us about the future of AI

How will AI development look in the future? There is a lot we can learn from another groundbreaking technology: the web. This blog post takes a look at what the history of the web can teach us, and what this means for developers, models, open source and regulation.

Streaming spaCy

Streaming spaCy

Join spaCy author and core developer Matt as he works on the library, develops features and fixes bugs, while chatting about all things NLP and open source. Every Thursday at 2pm CET and Friday at 11am CET.

Recognising non-named spatial entities in literary texts: a novel spatial entities classifier

Recognising non-named spatial entities in literary texts: a novel spatial entities classifier Kababgi, Grisot, Pennino, Herrmann (2024)

In this paper, we present a case study on the prediction of what we call ‘non-named spatial entities’ (NNSE) in a historical corpus of Swiss-German novels using a deep learning model in conjunction with BERT and Prodigy.

🔌 prodigy-pdf v0.4.0Nov 25, 2024

Add text-based span annotation for PDFs

Serverless custom NLP with LLMs, Modal and Prodigy

Serverless custom NLP with LLMs, Modal and Prodigy

In this blog post, we’ll show you how you can go from an idea and little data to a fully custom information extraction model using Prodigy and Modal, no infrastructure or GPU setup required.

✨ prodigy v1.16.0Oct 22, 2024

Modal plugin for on-demand deployment, cross-platform wheels and UI fixes

The 100 who are shaping AI in Europe

The 100 who are shaping AI in Europe

Ines is featured among the top 100 individuals who are shaping Artificial Intelligence in Europe, compiled by French newspaper l’Opinion.

How GitLab uses spaCy to analyze support tickets and empower their community

How GitLab uses spaCy to analyze support tickets and empower their community

A case study on GitLab’s large-scale NLP pipelines for extracting actionable insights from support tickets and usage questions.

The NLP and AI Revolution with the spaCy Creators

The NLP and AI Revolution with the spaCy Creators Vanishing Gradients

In this interview with Hugo Bowne-Anderson, we delve into the forefront of NLP and the future of AI development, covering topics like human-in-the-loop distillation, open-source AI and Explosion’s journey.

Back to our roots: Company update and future plans

Back to our roots: Company update and future plans

We’re back to running Explosion as a smaller, independent-minded and self-sufficient company. spaCy and Prodigy will stay stable and sustainable, maintained by their original authors. We’ll keep updating our stack wth the latest technologies, without changing its core identity or purpose.

Once a Maintainer: Sofie Van Landeghem

Interview with Sofie about her work as a core maintainer of spaCy, the evolution of NLP, and why dependency management in Python is so terrible.

How to uncover and avoid structural biases in evaluating your Machine Learning/NLP projects

How to uncover and avoid structural biases in evaluating your Machine Learning/NLP projects PyData London

This talk highlights common pitfalls that occur when evaluating ML and NLP approaches. It provides comprehensive advice on how to set up a solid evaluation procedure in general, and dives into a few specific use-cases to demonstrate artificial bias that unknowingly can creep in.

Sovereign AI systems instead of black box solutions

Sovereign AI systems instead of black box solutions it-daily

German article featuring Ines’ take on AI in industry, the role of open source, and using Generative AI to create systems.

Developer Trends in 2025

Developer Trends in 2025 TalkPython Podcast

Discussion with Michael Kennedy, Calvin Hendryx-Parker, Gina Häußge, Richard Campbell and Ines.

KI ohne Ketten: Warum Open Source gegen Big Tech gewinnen kann

KI ohne Ketten: Warum Open Source gegen Big Tech gewinnen kann UNMUTE IT Podcast (German)

Interview with Ines on open source, LLMs, ethics and sustainable AI development.

Künstliche Intelligenz: Technologie der Zukunft – und warum Open Source die Karten neu mischt

Künstliche Intelligenz: Technologie der Zukunft – und warum Open Source die Karten neu mischt Heise KI-Woche 2025 (German)

German talk on the future of Artificial Intelligence and the impact of open-source software and models.

✨ prodigy v1.18.0Feb 24, 2025

Text editing during NER and span annotation, custom translations and more JavaScript features

What the history of the web can teach us about the future of AI

What the history of the web can teach us about the future of AI PyCon+Web Keynote

In this talk, Ines takes a look at what the history of the web can teach us about the future of AI, and what this means for developers, models, open source and regulation.

Cracking the Code: How to Start a Career in AI

Cracking the Code: How to Start a Career in AI Welcome to the Jungle

Short video interview with Ines about the 4 skills job hunters can cultivate for a career in artificial intelligence.

From PDFs to AI-ready structured data: a deep dive

From PDFs to AI-ready structured data: a deep dive

This blog post presents a new modular workflow for converting PDFs and similar documents to structured data and shows you how to build end-to-end document understanding and information extraction pipelines for industry use cases.

✨ prodigy v1.17.0Nov 18, 2024

Pages UI for multi-page tasks like longer documents, PDFs or collections of images

Accelerate your Career with Open-Source AI

Accelerate your Career with Open-Source AI dotAI

Panel discussion about making a career out of open-source software, featuring Gael Varoquaux (scikit-learn), Steeve Morin (ZML) and Ines.

💫 spacy v3.8.0Oct 1, 2024

Memory management for persistent services, numpy 2.0 support

Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation

Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation InfoQ Dev Summit

LLMs have enormous potential, but also challenge existing workflows in industry that require modularity, transparency and data privacy. In this talk, Ines shows some practical solutions for using the latest models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

Szczecin stolicą programowania

Szczecin stolicą programowania TVP3 Szczecin

News segment about EuroSciPy 2024 on local Polish television, featuring Ines’ talk and interviews with the organizers.

spaCy Chunks v0.0.2

spaCy Chunks v0.0.2

spaCy extension and pipeline component for generating overlapping chunks of sentences or tokens from a document.

Building the Future of NLP: Insights on spaCy, Prodigy and Generative AI

Building the Future of NLP: Insights on spaCy, Prodigy and Generative AI Leading With Data Podcast

A practical guide to human-in-the-loop distillation

A practical guide to human-in-the-loop distillation

This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

Conquering PDFs: document understanding beyond plain text

Conquering PDFs: document understanding beyond plain text PyData London

In this talk, Ines presents a new and modular approach for building robust document understanding systems, using state-of-the-art models and the awesome Python ecosystem.

Feminist AI LAN Party

Feminist AI LAN Party PyCon DE & PyData

Three days of workshops, hacking, creating, publishing and connecting locally, featuring a data development workshop with Prodigy and a session on hacking LLMs.

KI zwischen Freiheit und Kontrolle: The AI Revolution Will Not Be Monopolized

KI zwischen Freiheit und Kontrolle: The AI Revolution Will Not Be Monopolized data:unplugged (German)

How should we envision the use of AI in practice? And are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies?

Mastering spaCy

Mastering spaCy Déborah Mesquita, Duygu Altinok (Packt Publishing, 2025)

Build structured NLP solutions with custom components and models powered by LLMs. By end of the book you will be empowered to build robust NLP pipelines and integrate them with web applications to build end-to-end solutions.

Using natural language processing to identify emergency department patients with incidental lung nodules requiring follow-up

Using natural language processing to identify emergency department patients with incidental lung nodules requiring follow-up Moore, Socrates, Hesami, Denkewicz, Cavallo, Venkatesh, Taylor (2025)

CT reports were annotated by MD raters using Prodigy software to develop a stepwise NLP “pipeline” that first excluded prior or known malignancy, determined the presence of a lung nodule, and then categorized any recommended follow-up. NLP was developed using a RoBERTa large language model on the spaCy platform.

Prodigy Dashboard Plugin

Prodigy Dashboard Plugin

The new dashboard plugin adds a web application for managing annotations, data analytics and annotation progress, and is now available for early beta testing.

spaCy Natural Language Processing: From Beginner to Advanced

spaCy Natural Language Processing: From Beginner to Advanced Guan Wang, Xiaoquan Kong (2024)

The first Chinese-language book on spaCy for beginners and experienced practitioners, covering traditional NLP techniques and how to leverage LLMs for various NLP tasks.

🔌 prodigy-pdf v0.3.0Nov 18, 2024

Support multi-page PDFs in a single view

uOttawa at LegalLens-2024: Transformer-based Classification Experiments

uOttawa at LegalLens-2024: Transformer-based Classification Experiments Meghdadi, Inkpen (2024)

Our training utilizes the spaCy pipeline configured with a transformer model and a transition-based parser for NER tasks. The deberta-v3-base model has been selected for the main transformer architecture.

Reality is not an End-to-End Prediction Problem: Applied NLP in the Age of Generative AI

Reality is not an End-to-End Prediction Problem: Applied NLP in the Age of Generative AI dotAI

Applied NLP in the Age of Generative AI

Applied NLP in the Age of Generative AI PyData Amsterdam Keynote

In this talk, Ines shares the most important lessons we’ve learned from solving real-world information extraction problems in industry, and shows you a new approach and mindset for designing robust and modular NLP pipelines in the age of Generative AI.

10 Years of Open Source: Navigating the Next AI Revolution

10 Years of Open Source: Navigating the Next AI Revolution EuroSciPy Keynote

In this talk, Ines shares the most important lessons we’ve learned in 10 years of working on open-source software, our core philosophies that helped us adapt to an ever-changing AI landscape and why open source and interoperability still wins over black-box, proprietary APIs.

Practical Tips for Bootstrapping Information Extraction Pipelines

Practical Tips for Bootstrapping Information Extraction Pipelines DataHack Summit

This talk presents approaches for bootstrapping NLP pipelines and retrieval via information extraction, including tips for training, modelling and data annotation.

Happy 10th Birthday, spaCy!

Happy 10th Birthday, spaCy!

10 years ago today Matt pushed the first commit to spaCy. Since then, the library has evolved as the field moved forward, but also stayed true to its core mission: industrial-strength NLP.

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment.

Applied NLP in the Age of Generative AI: Future-Proof Strategies for Banking and Finance

Applied NLP in the Age of Generative AI: Future-Proof Strategies for Banking and Finance ECONDAT Keynote

A modern approach and mindset for building future-proof NLP pipelines in-house, focusing on use cases from banking, finance and economics.

Conquering PDFs: document understanding beyond plain text

Conquering PDFs: document understanding beyond plain text PyCon DE & PyData

In this talk, Ines presents a new and modular approach for building robust document understanding systems, using state-of-the-art models and the awesome Python ecosystem.

How to advocate for modular NLP in the age of Generative AI

How to advocate for modular NLP in the age of Generative AI

With all the hype around Generative AI, many are led to believe it’s the solution to everything. So how can you, as a developer, communicate the nuances and advocate for new and modular solutions that are better, easier and cheaper?

Prozessvisualisierung mit generativer KI im Praxistest

Prozessvisualisierung mit generativer KI im Praxistest iX Magazin / Heise

German article by Nils Durner on visualizing technical processes with Generative AI, featuring spaCy and Presidio for PII anonymization.

Best Way to OCR a PDF in Python

Best Way to OCR a PDF in Python Python Tutorials for Digital Humanities

Tutorial by WJB Mattingly on how to use the new spaCy Layout package and Docling to convert PDFs to text.

PyLadies entrepreneurs and career development

PyLadies entrepreneurs and career development PyLadiesCon

Panel discussion about career challenges and starting your own business with Cheuk Ting Ho, Tereza Iofciu, Anwesha Das, Una Galyeva and Ines.

📚 spacy-layout v0.0.6Nov 24, 2024

Add support for tables and convert tabular data to pandas.DataFrame

📚 spacy-layout v0.0.1Nov 18, 2024

Process PDFs, Word documents and more with spaCy

Distill Your LLMs and Surpass Their Performance

Distill Your LLMs and Surpass Their Performance InfoQ Magazine

In her presentation at InfoQ Dev Summit, Ines Montani provided the audience with practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components.

Applied NLP with LLMs: Beyond Black-Box Monoliths

Applied NLP with LLMs: Beyond Black-Box Monoliths PyBerlin

In this talk, Ines shows some practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components.

Combining the Best of Two Worlds: From TF-IDF to Llama LLM

Combining the Best of Two Worlds: From TF-IDF to Llama LLM Open Source Summit Europe

Talk by William Arias, Staff Developer Advocate at GitLab, on combining traditional NLP techniques and LLMs to solve hallucination issues and create robust spaCy applications.

Assessing Fine-Tuned NER Models with Limited Data in French: Automating Detection of New Technologies, Technological Domains, and Startup Names in Renewable Energy

Assessing Fine-Tuned NER Models with Limited Data in French: Automating Detection of New Technologies, Technological Domains, and Startup Names in Renewable Energy MacLean, Cavallucci (2024)

In order to assure the uniformity of the process of fine-tuning each model, we decided to use the spaCy library. This library, one of the most widely used for NLP tasks, allows us to directly modify a simple configuration file in order to define the model.

Toward Automatic Summarization of Hospital Discharge Notes

Toward Automatic Summarization of Hospital Discharge Notes Landes (2024)

For NLP tasks, vectorizers include spaCy token features such as part of speech (POS) tags, named entity recognition (NER) tags, dependency head relations and depth.

The AI Revolution Will Not Be Monopolized

The AI Revolution Will Not Be Monopolized InfoQ

Open-source initiatives are pivotal in democratizing AI technology, offering transparent, extensible tools that empower users. Daniel Dominguez summarizes the key takeaways from Ines’ recent talk for InfoQ.

Exploring the AI nexus with the mind behind spaCy

Exploring the AI nexus with the mind behind spaCy Leading With Data Podcast

In this episode, Matt takes you on a deep dive into the future of data and the challenges facing current Large Language Models (LLMs).