Category: Blog · Explosion · Developer tools and consulting for AI, Machine Learning and NLP

Explosion builds developer tools for AI, Machine Learning and Natural Language Processing. →
Consulting

Project

Topics

Category

Tasks

Authors

Filtered by category: Blog

Beta test our new product for agentic NLP development

Beta test our new product for agentic NLP development

We’re looking for beta partners for Ellf: a platform and virtual assistant that makes your coding agent like Claude Code proficient at developing NLP solutions. Contact us or sign up for the waitlist if your team needs help with a project focused on tasks like information extraction!

Engineering a human-aligned LLM evaluation workflow with Prodigy and DSPy

Engineering a human-aligned LLM evaluation workflow with Prodigy and DSPy

This post demonstrates a human-in-the-loop workflow for developing and evaluating LLMs, using Prodigy and DSPy to create task-specific, human-aligned metrics that guide model optimization beyond generic evaluation measures.

From PDFs to AI-ready structured data: a deep dive

From PDFs to AI-ready structured data: a deep dive

This blog post presents a new modular workflow for converting PDFs and similar documents to structured data and shows you how to build end-to-end document understanding and information extraction pipelines for industry use cases.

The Window-Knocking Machine Test

The Window-Knocking Machine Test

How will technology shape our world going forward? And what tools and products should we build? When imagining what the future could look like, it helps to look back in time and compare past visions to our reality today.

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment.

ACL LAW Workshop Poster

ACL LAW Workshop Poster ACL 2023

spaCy Plugin for VSCode

spaCy Plugin for VSCode

The spaCy VSCode Extension provides additional tooling and features for working with spaCy’s config files. Version 1.0.0 includes hover descriptions for registry functions, variables, and section names within the config as an installable extension.

Deploying a Prodigy cloud service for Posh’s financial chatbots

Deploying a Prodigy cloud service for Posh’s financial chatbots

A Prodigy case study of Posh AI's production-ready annotation platform and custom chatbot annotation tasks for banking customers.

Training spaCy NER Models with Prodigy

Training spaCy NER Models with Prodigy

This handy flowchart contains our most common tips, tricks, and best practices for training and updating spaCy named entity recognition models with Prodigy.

spaCy Cheat Sheet

spaCy Cheat Sheet

Everything you need to know about spaCy as a handy two-page PDF.

floret: lightweight, robust word vectors

floret: lightweight, robust word vectors

An exploration of floret vectors: lightweight vectors for noisy data, novel words, rich morphology and more.

Diary of a spaCy project: Predicting GitHub Tags

Diary of a spaCy project: Predicting GitHub Tags

Many people assume that working on an NLP project involves a lot of machine learning. Our experience is that it's much less about flowing tensors, and more about making a tailored solution. This blogposts demonstrates how a typical spaCy project could be initiated, implemented and executed towards a custom solution.

Explosion in 2021: Our Year in Review

Explosion in 2021: Our Year in Review

The year 2021 is coming to an end, and like the previous year, it was shaped by unique challenges that impacted our work together. For Explosion, it was a very productive year. We found an investor that fits our strategy, the work on Prodigy Teams is in full swing, and the team has grown a lot. So here's our look back at our highlights of the year 2021.

spaCy v3's project and config systems are pretty great

spaCy v3's project and config systems are pretty great

The road to production has become increasingly harder. Machine Learning Engineers who turn prototypes into production-ready software face difficulties with the lack of tooling and best-practices. spaCy v3, with its configuration and project system, introduced a way to solve this problem. Here's my take on how it works, and how it can ramp-up your team!

Applied NLP Thinking: How to Translate Problems into Solutions

Applied NLP Thinking: How to Translate Problems into Solutions

We’ve been running Explosion for about five years now, which has given us a lot of insights into what Natural Language Processing looks like in industry contexts. In this blog post, I’m going to discuss some of the biggest challenges for applied NLP and translating business problems into machine learning solutions.

Introducing spaCy v2.3

Introducing spaCy v2.3

spaCy now speaks Chinese, Japanese, Danish, Polish and Romanian! Version 2.3 of the spaCy Natural Language Processing library adds models for five new languages. We've also updated all 15 model families with word vectors and improved accuracy, while also decreasing model size and loading times for models with vectors.

Introducing spaCy v2.2

Introducing spaCy v2.2

Version 2.2 of the spaCy Natural Language Processing library is leaner, cleaner and even more user-friendly. In addition to new model packages and features for training, evaluation and serialization, we've made lots of bug fixes, improved debugging and error handling, and greatly reduced the size of the library on disk.

Introducing spaCy v2.1

Introducing spaCy v2.1

Version 2.1 of the spaCy Natural Language Processing library includes a huge number of features, improvements and bug fixes. In this post, we highlight some of the things we're especially pleased with, and explain some of the most challenging parts of preparing this big release.

Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP

Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP

Sometimes you want to fine-tune a pre-trained model to add a new label or correct some specific errors. This can introduce the "catastrophic forgetting" problem. Pseudo-rehearsal is a good solution: use the original model to label examples, and mix them through your fine-tuning updates.

Supervised learning is great — it's data collection that's broken

Supervised learning is great — it's data collection that's broken

Short of Artificial General Intelligence, we'll always need some way of specifying what we're trying to compute. Labelled examples are a great way to do that, but the process is often tedious. However, the dissatisfaction with supervised learning is misplaced. Instead of waiting for the unsupervised messiah to arrive, we need to fix the way we're collecting and reusing human knowledge.

spaCy v1.0: Deep Learning with custom pipelines and Keras

spaCy v1.0: Deep Learning with custom pipelines and Keras

I'm pleased to announce the 1.0 release of spaCy, the fastest NLP library in the world. By far the best part of the 1.0 release is a new system for integrating custom models into spaCy. This post introduces you to the changes, and shows you how to use the new custom pipeline functionality to add a Keras-powered LSTM sentiment analysis model into a spaCy pipeline.

How front-end development can improve Artificial Intelligence

How front-end development can improve Artificial Intelligence

What's holding back Artificial Intelligence? While researchers rightly focus on better algorithms, there are a lot more things to be done. In this post I'll discuss three ways in which front-end development can improve AI technology: by improving the collection of annotated data, communicating the capabilities of the technology to key stakeholders, and exploring the system's behaviours and errors.

spaCy now speaks German

spaCy now speaks German

Many people have asked us to make spaCy available for their language. Being based in Berlin, German was an obvious choice for our first second language. Now spaCy can do all the cool things you use for processing English on German text too. But more importantly, teaching spaCy to speak German required us to drop some comfortable but English-specific assumptions about how language works and made spaCy fit to learn more languages in the future.

How spaCy Works

How spaCy Works

This post was pushed out in a hurry, immediately after spaCy was released. It explains some of how spaCy is designed and implemented, and provides some quick notes explaining which algorithms were used. The post pre-dates spaCy's named entity recogniser, but it provides some detail about the tokenisation algorithm, general design, and efficiency concerns.

A Good Part-of-Speech Tagger in about 200 Lines of Python

A Good Part-of-Speech Tagger in about 200 Lines of Python

Up-to-date knowledge about natural language processing is mostly locked away in academia. And academics are mostly pretty self-conscious when we write. We’re careful. We don’t want to stick our necks out too much. But under-confident recommendations suck, so here’s how to write a good part-of-speech tagger.

Atomic NLP

Atomic NLP

An applied NLP methodology inspired by Atomic Design: building reliable language understanding systems out of small, composable components instead of one big model and a prompt.

How to advocate for modular NLP in the age of Generative AI

How to advocate for modular NLP in the age of Generative AI

With all the hype around Generative AI, many are led to believe it’s the solution to everything. So how can you, as a developer, communicate the nuances and advocate for new and modular solutions that are better, easier and cheaper?

Serverless custom NLP with LLMs, Modal and Prodigy

Serverless custom NLP with LLMs, Modal and Prodigy

In this blog post, we’ll show you how you can go from an idea and little data to a fully custom information extraction model using Prodigy and Modal, no infrastructure or GPU setup required.

Back to our roots: Company update and future plans

Back to our roots: Company update and future plans

We’re back to running Explosion as a smaller, independent-minded and self-sufficient company. spaCy and Prodigy will stay stable and sustainable, maintained by their original authors. We’ll keep updating our stack wth the latest technologies, without changing its core identity or purpose.

How Nesta uses NLP to process 7m job ads and shed light on the UK’s labor market

How Nesta uses NLP to process 7m job ads and shed light on the UK’s labor market

A case study on Nesta’s workflow for extracting 7 million job ads to better understand UK skill demand, using a custom mapping step to match skills to any government taxonomy.

Introducing spaCy v3.6

Introducing spaCy v3.6

spaCy v3.6 introduces the span finder component and trained pipelines for Slovenian.

Implementing a custom trainable component for relation extraction

Implementing a custom trainable component for relation extraction

Relation extraction refers to the process of predicting and labeling semantic relationships between named entities. In this blog post, we'll go over the process of building a custom relation extraction component using spaCy and Thinc. We'll also add a Hugging Face transformer to improve performance at the end of the post. You'll see how you can utilize Thinc's flexible and customizable system to build an NLP pipeline for biomedical relation extraction.

Towards a Tagalog NLP pipeline

Towards a Tagalog NLP pipeline

In this blog post, Lj talks about how he built an NER pipeline for Tagalog, the gold-standard dataset, benchmarking results, and his hopes for the future of Tagalog NLP.

Reflections on a year of spaCy consulting at Explosion

Reflections on a year of spaCy consulting at Explosion

In this post, Peter shares some lessons learned from chatting with practitioners about their NLP challenges, developing production-ready NLP pipelines for clients, and working with an open-source development team.

How the Guardian approaches quote extraction with NLP

How the Guardian approaches quote extraction with NLP

A case study of the Guardian's spaCy-Prodigy workflow to modularize quote extraction for content creation. This study includes iterative annotation guidelines and custom interface functionality.

Introducing Holmes 4.0

Introducing Holmes 4.0

A few weeks ago we released version 4.0 of Holmes, which we are now able to offer under a permissive MIT license. Holmes is a library in the spaCy Universe that runs on top of spaCy and enables information extraction and intelligent search, currently for English and German. Holmes goes beyond simple matching algorithms and allows you to look for a specified idea or ideas in a corpus of documents.

Introducing spaCy v3.3

Introducing spaCy v3.3

spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.

Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects

Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects

Create better access to health with machine learning and natural language processing. Read about our journey of developing Healthsea, an end-to-end spaCy pipeline for analyzing user reviews to supplement products and extracting potential effects on health.

Introducing spaCy v3.2

Introducing spaCy v3.2

spaCy v3.2 features usability improvements for custom training and scoring, improved performance and support for floret, our new fastText word vectors algorithm.

Introducing spaCy v3.0

Introducing spaCy v3.0

spaCy v3.0 is a huge release! It features new transformer-based pipelines that get spaCy's accuracy right up to the current state-of-the-art, and a new workflow system to help you take projects from prototype to production. It's much easier to configure and train your pipeline, and there are lots of new and improved integrations with the rest of the NLP ecosystem.

Explosion in 2019: Our Year in Review

Explosion in 2019: Our Year in Review

As 2019 draws to a close and we step into the 2020s, we thought we’d take a look back at the year and all we’ve accomplished. And we realized we had so much that we could give you a month-by-month rundown of everything that happened.

spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2

spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2

Huge transformer models like BERT, GPT-2 and XLNet have set a new standard for accuracy on almost every NLP leaderboard. You can now use these models in spaCy, via a new interface library we've developed that connects spaCy to Hugging Face's awesome implementations.

The process: Transforming spaCy’s docs

The process: Transforming spaCy’s docs Increment Magazine

Making your documentation work for users with vastly different needs is a challenge. Here’s how spaCy, an open-source library for natural language processing, did it.

Building Prodigy: Our new tool for efficient machine teaching

Building Prodigy: Our new tool for efficient machine teaching ines.io

The philosophy behind Prodigy’s features and its cloud-free design.

Supervised similarity: Learning symmetric relations from duplicate question data

Supervised similarity: Learning symmetric relations from duplicate question data

Supervised models for text-pair classification let you create software that assigns a label to two texts, based on some relationship between them. When the relationship is symmetric, it can be useful to incorporate this constraint into the model. This post shows how a siamese convolutional neural network performs on two duplicate question data sets with experimental results.

An open-source named entity visualizer for the modern web

An open-source named entity visualizer for the modern web

Named Entity Recognition is a crucial technology for NLP. Whatever you're doing with text, you usually want to handle names, numbers, dates and other entities differently from regular words. To help you make use of NER, we've released displaCy-ent.js. This post explains how the library works, and how to use it.

A natural language user interface is just a user interface

A natural language user interface is just a user interface

Let’s say you’re writing an application, and you want to give it a conversational interface: your users will type some command, and your application will do something in response, possibly after asking for clarification.

Statistical NLP in the Ten Hundred Most Common English Words

Statistical NLP in the Ten Hundred Most Common English Words

When I was little, my favorite TV shows all had talking computers. Now I’m big and there are still no talking computers, so I’m trying to make some myself. Well, we can make computers say things. But when we say things back, they don’t really understand. Why not?

Introducing spaCy

Introducing spaCy

Computers don't understand text. This is unfortunate, because that's what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorise it, generate it and correct it. spaCy provides a library of utility functions that help programmers build such products.

The ultimate guide to optimizing annotation workflows

The ultimate guide to optimizing annotation workflows

This blog post collects tips and advice for how to build efficient human-in-the-loop data development workflows, break down business problems into actionable annotation steps and make the most of automation and model assistance.

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

How Love Without Sound helps the music industry recover millions in revenue for artists with NLP, spaCy and Prodigy

A case study on Love Without Sound’s innovative AI-powered tools for the music industry and law firms specializing in royalty negotiations.

The 100 who are shaping AI in Europe

The 100 who are shaping AI in Europe

Ines is featured among the top 100 individuals who are shaping Artificial Intelligence in Europe, compiled by French newspaper l’Opinion.

Happy 10th Birthday, spaCy!

Happy 10th Birthday, spaCy!

10 years ago today Matt pushed the first commit to spaCy. Since then, the library has evolved as the field moved forward, but also stayed true to its core mission: industrial-strength NLP.

Prodigy in 2023: LLMs, task routers, QA and plugins

Prodigy in 2023: LLMs, task routers, QA and plugins

We have made a ton of new updates in Prodigy this year with v1.12, v1.13, and v1.14 releases. So we decided to write a post about them.

Large Disagreement Modelling

Large Disagreement Modelling

“In this blogpost I’d like to talk about large language models. There’s a bunch of hype, sure, but there’s also an opportunity to revisit one of my favourite machine learning techniques: disagreement.”

The Tale of Bloom Embeddings and Unseen Entities

The Tale of Bloom Embeddings and Unseen Entities

The default Bloom embedding layer in spaCy is unconventional, but very powerful and efficient. We wrote about it before and showed the advantages it provides in terms of memory efficiency for our floret embeddings. Now we have released the first technical report by Explosion, where we explain Bloom embeddings in more detail and rigorously compare them to traditional embeddings. In this post we'll highlight some of our results with a special focus on unseen entities.

Introducing spaCy v3.5

Introducing spaCy v3.5

spaCy v3.5 introduces new CLI commands, fuzzy matching, improvements for entity linking and more.

Setting your ML project up for success

Setting your ML project up for success

“What can you do to maximize probability of success for your Machine Learning solution? Throughout my 15 years as data scientist in academia, big pharma and through consulting, one common theme has emerged: the most reliable predictor of success for any NLP or ML-based solution is whether or not you involve the data science team early on.”

End-to-end Neural Coreference Resolution in spaCy

End-to-end Neural Coreference Resolution in spaCy

Coreference resolution is the problem of resolving entities in texts to references such as pronouns. Even if you've never heard of it, it's something we all do constantly every day, and is a key to understanding natural language. We recently added an experimental implementation of an end-to-end neural coreference component to spaCy. This post explains the architecture of our model in detail.

Introducing spaCy v3.4

Introducing spaCy v3.4

spaCy v3.4 brings typing and speed improvements along with new vectors for English CNN pipelines and new trained pipelines for Croatian.

Compact word vectors with Bloom embeddings

Compact word vectors with Bloom embeddings

An introduction to the compact word vectors with Bloom embeddings used in Thinc, spaCy and floret.

Universal Dependencies v2.5 Benchmarks for spaCy

Universal Dependencies v2.5 Benchmarks for spaCy

We present Universal Dependencies v2.5 benchmarks for spaCy v3.2 that show the competitive performance of spaCy in a direct comparison with Stanza and Trankit using the end-to-end evaluation from the CoNLL 2018 Shared Task.

We’ve sold 5% of Explosion

We’ve sold 5% of Explosion

Since founding Explosion in 2016, we’ve run the company as a profitable business and we decided to only consider external investment if we could find a deal that wouldn’t compromise the direction or stability of the company. We’re pleased to announce that we’ve found an investment that ticks all the boxes.

Explosion in 2020: Our Year in Review

Explosion in 2020: Our Year in Review

While 2020 hasn’t been easy for anyone, at Explosion we’ve considered ourselves relatively fortunate in this most interesting year. We’ve always worked remotely, so we’ve been able to take both pride and comfort in continuing to ship good software. Here’s a look back at what we’ve been up to.

sense2vec reloaded: contextually-keyed word vectors

sense2vec reloaded: contextually-keyed word vectors

In 2016 we trained a sense2vec model on the 2015 portion of the Reddit comments corpus, leading to a useful library and one of our most popular demos. That work is now due for an update. In this post, we present a new version and a demo NER project that we trained to usable accuracy in just a few hours.

spaCy IRL 2019: 2 days of NLP in Berlin

spaCy IRL 2019: 2 days of NLP in Berlin

We were pleased to invite the spaCy community and other folks working on Natural Language Processing to Berlin this summer for a small and intimate event.

Explosion in 2017: Our Year in Review

Explosion in 2017: Our Year in Review

We founded Explosion in October 2016, so this was our first full calendar year in operation. We set ourselves ambitious goals this year, and we're very happy with how we achieved them. Here's what we got done.

Prodigy: A new tool for radically efficient machine teaching

Prodigy: A new tool for radically efficient machine teaching

Machine learning systems are built from both code and data. It's easy to reuse the code but hard to reuse the data, so building AI mostly means doing annotation. This is good, because the examples are how you program the behaviour – the learner itself is really just a compiler. What's not good is the current technology for creating the examples. That's why we're pleased to introduce Prodigy, a downloadable tool for radically efficient machine teaching.

Deep text-pair classification with Quora's 2017 question dataset

Deep text-pair classification with Quora's 2017 question dataset

Quora recently released the first dataset from their platform: a set of 400,000 question pairs, with annotations indicating whether the questions request the same information. This data set is large, real, and relevant — a rare combination. In this post, I'll explain how to solve text-pair tasks with deep learning, using both new and established tips and technologies.

displaCy.js: An open-source NLP visualizer for the modern web

displaCy.js: An open-source NLP visualizer for the modern web

With new offerings from Google, Microsoft and others, there are now a range of excellent cloud APIs for syntactic dependencies. A key part of these services is the interactive demo, where you enter a sentence and see the resulting annotation. We're pleased to announce the release of displaCy.js, a modern and service-independent visualization library. We hope this makes it easy to compare different services, and explore your own in-house models.

SyntaxNet in context: Understanding Google's new TensorFlow NLP model

SyntaxNet in context: Understanding Google's new TensorFlow NLP model

Yesterday, Google open sourced their Tensorflow-based dependency parsing library, SyntaxNet. The library gives access to a line of neural network parsing models published by Google researchers over the last two years. I've been following this work closely since it was published, and have been looking forward to the software being published. This post tries to provide some context around the release — what's new here, and how important is it?

Sense2vec with spaCy and Gensim

Sense2vec with spaCy and Gensim

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et. al, 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we've found surprisingly addictive.

Writing C in Cython

Writing C in Cython

For the last two years, I’ve done almost all of my work in Cython. And I don’t mean, I write Python, and then “Cythonize” it, with various type-declarations et cetera. I just, write Cython. I use "raw" C structs and arrays, and occasionally C++ vectors, with a thin wrapper around malloc/free that I wrote myself. The code is almost always exactly as fast as C/C++, because that's really all it is, but with Python right there, if I want it.

Style tips for less experienced developers coding with AI

Getting good performance out of LLM coding is less about prompts or agents, and more about the code you steer the model towards. Matt shares some lessons software engineers have learned over the years about building bigger things.

What the history of the web can teach us about the future of AI

What the history of the web can teach us about the future of AI

How will AI development look in the future? There is a lot we can learn from another groundbreaking technology: the web. This blog post takes a look at what the history of the web can teach us, and what this means for developers, models, open source and regulation.

How GitLab uses spaCy to analyze support tickets and empower their community

How GitLab uses spaCy to analyze support tickets and empower their community

A case study on GitLab’s large-scale NLP pipelines for extracting actionable insights from support tickets and usage questions.

A practical guide to human-in-the-loop distillation

A practical guide to human-in-the-loop distillation

This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

Launching the Explosion Merch Store

Launching the Explosion Merch Store

Spread the love and support us and our open-source work with some of our unique, custom-designed swag. All orders come with free shipping and stickers!

Against LLM maximalism

Against LLM maximalism

LLMs are not a direct solution to most of the NLP use-cases companies have been working on. They are extremely useful, but if you want to deliver reliable software you can improve over time, you can't just write a prompt and call it a day. Once you're past prototyping and want to deliver the best system you can, supervised learning will often give you better efficiency, accuracy and reliability.

Rulers, NER, and data iteration

Rulers, NER, and data iteration

About the power of Rules + ML and the importance of iteration on your pipeline and your data.

Explosion in 2022: Our Year in Review

Explosion in 2022: Our Year in Review

It's been another exciting year at Explosion! We've developed a new end-to-end neural coref component for spaCy, improved the speed of our CNN pipelines up to 60%, and published new pre-trained pipelines for Finnish, Korean, Swedish and Croatian. We've also released several updates to Prodigy and introduced new recipes to kickstart annotation with zero- or few-shot learning.

Fast transformer inference with Metal Performance Shaders

Fast transformer inference with Metal Performance Shaders

We are happy to introduce support for Metal Performance Shaders in Thinc PyTorch layers. This makes it possible to run spaCy transformer-based pipelines on GPU on Apple Silicon Macs and improves inference speed up to 4.7 times.

spaCy behind the scenes: library patterns & design concepts explained

spaCy behind the scenes: library patterns & design concepts explained

Developer productivity has been central to our design of spaCy, both in smaller decisions and some of the bigger architectural questions. We believe in embracing the complexities of machine learning, not hiding it away under leaky abstractions, while also maintaining the developer experience. Read on to learn some of the design patterns within the library, how we've implemented them, and most importantly, why.

Spancat: a new approach for span labeling

Spancat: a new approach for span labeling

The SpanCategorizer is a spaCy component that answers the NLP community's need to have structured annotation for a wide variety of labeled spans, including long phrases, non-named entities, or overlapping annotations. In this blog post, we're excited to talk more about spancat and showcase new features to help with your span labeling needs!

Introducing spaCy Tailored Pipelines

Introducing spaCy Tailored Pipelines

Explosion is pleased to announce a new development services offering, spaCy Tailored Pipelines. We’ll build you a custom natural language processing pipeline, delivered in a standardized format using spaCy’s projects system.

Neural edit-tree lemmatization for spaCy

Neural edit-tree lemmatization for spaCy

We are happy to introduce a new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages. This lemmatizer learns to predict lemmatization rules from a corpus of examples and removes the need to write an exhaustive set of per-language lemmatization rules.

Introducing spaCy v3.1

Introducing spaCy v3.1

It’s been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new training system and more. Version 3.1 adds more on top of it, including the ability to use predicted annotations during training, a component for predicting arbitrary and overlapping spans and new pipelines for Catalan and Danish.

Ines becomes a Python Software Foundation Fellow

Explosion awarded META Seal of Recognition

Explosion awarded META Seal of Recognition

We’re proud to accept the META Seal of Recognition at META-FORUM in Brussels, along with Mozilla. The META-FORUM is an international conference series backed by the European Union on powerful and innovative Language Technologies for a multilingual information society.

Advanced NLP with spaCy: A free online course

Advanced NLP with spaCy: A free online course

In this free and interactive online course, you’ll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.

Introducing custom pipelines and extensions for spaCy v2.0

Introducing custom pipelines and extensions for spaCy v2.0

As the release candidate for spaCy v2.0 gets closer, we've been excited to implement some of the last outstanding features. One of the best improvements is a new system for adding pipeline components and registering extensions to the Doc, Span and Token objects. In this post, we'll introduce you to the new functionality, and finish with an example extension package, spacymoji.

Reflections on running spaCy: commercial open-source NLP ines.io

As more and more people and companies are getting involved with open-source software, balancing the expectations of an open community and a traditional provider vs. consumer relationship is becoming increasingly difficult. Are maintainers becoming too authoritarian? Are users becoming too demanding? Are large companies selling out open-source?

Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models

Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models

Over the last six months, a powerful new neural network playbook has come together for Natural Language Processing. The new approach can be summarised as a simple four-step formula: embed, encode, attend, predict. This post explains the components of this new approach, and shows how they're put together in two recent systems.

Introducing Explosion AI

Introducing Explosion AI

The problem with developing a machine learning model is that you don't know how well it'll work until you try — and trying is very expensive. Obviously, this risk is unappealing, but the existing solution in the market, one-size-fits-all cloud services, are even worse. We're launching Explosion AI to give you a better option.

Multi-threading spaCy's parser and named entity recognizer

Multi-threading spaCy's parser and named entity recognizer

In v0.100.3, we quietly rolled out support for GIL-free multi-threading for spaCy's syntactic dependency parsing and named entity recognition models. Because these models take up a lot of memory, we've wanted to release the global interpretter lock (GIL) around them for a long time. When we finally did, it seemed a little too good to be true, so we delayed celebration — and then quickly moved on to other things. It's now past time for a write-up.

Dead Code Should Be Buried

Dead Code Should Be Buried

Natural Language Processing moves fast, so maintaining a good library means constantly throwing things away. Most libraries are failing badly at this, as academics hate to editorialize. This post explains the problem, why it's so damaging, and why I wrote spaCy to do things differently.

Parsing English in 500 Lines of Python

Parsing English in 500 Lines of Python

This post explains how transition-based dependency parsers work, and argues that this algorithm represents a break-through in natural language understanding. A concise sample implementation is provided, in 500 lines of Python, with no external dependencies. This post was written in 2013. In 2015 this type of parser is now increasingly dominant.