Introducing custom pipelines and extensions for spaCy v2.0
As the release candidate for spaCy v2.0 gets closer, we've been excited to implement some of the last outstanding features. One of the best improvements is a new system for adding pipeline components and registering extensions to the Doc, Span and Token objects. In this post, we'll introduce you to the new functionality, and finish with an example extension package, spacymoji.
Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP
Sometimes you want to fine-tune a pre-trained model to add a new label or correct some specific errors. This can introduce the "catastrophic forgetting" problem. Pseudo-rehearsal is a good solution: use the original model to label examples, and mix them through your fine-tuning updates.
Prodigy: A new tool for radically efficient machine teaching
Machine learning systems are built from both code and data. It's easy to reuse the code but hard to reuse the data, so building AI mostly means doing annotation. This is good, because the examples are how you program the behaviour – the learner itself is really just a compiler. What's not good is the current technology for creating the examples. That's why we're pleased to introduce Prodigy, a downloadable tool for radically efficient machine teaching.
Supervised learning is great — it's data collection that's broken
Short of Artificial General Intelligence, we'll always need some way of specifying what we're trying to compute. Labelled examples are a great way to do that, but the process is often tedious. However, the dissatisfaction with supervised learning is misplaced. Instead of waiting for the unsupervised messiah to arrive, we need to fix the way we're collecting and reusing human knowledge.
Supervised similarity: Learning symmetric relations from duplicate question data
Supervised models for text-pair classification let you create software that assigns a label to two texts, based on some relationship between them. When the relationship is symmetric, it can be useful to incorporate this constraint into the model. This post shows how a siamese convolutional neural network performs on two duplicate question data sets, with experimental results and an interactive demo.
Deep text-pair classification with Quora's 2017 question dataset
Quora recently released the first dataset from their platform: a set of 400,000 question pairs, with annotations indicating whether the questions request the same information. This data set is large, real, and relevant — a rare combination. In this post, I'll explain how to solve text-pair tasks with deep learning, using both new and established tips and technologies.
Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models
Over the last six months, a powerful new neural network playbook has come together for Natural Language Processing. The new approach can be summarised as a simple four-step formula: embed, encode, attend, predict. This post explains the components of this new approach, and shows how they're put together in two recent systems.
The spaCy user survey: results and analysis
In the run-up to the 1.0 release, we asked the spaCy community to give us their feedback on the library. If you're one of the 224 participants who took part — thanks! Here's what we've learned from your responses, how we're already using them to improve the library, and what we're planning next.
Building your bot's brain with Node.js and spaCy
Natural Language Processing and other AI technologies promise to let us build applications that offer smarter, more context-aware user experiences. However, an application that is almost smart is often very, very dumb. In this tutorial, I'll show you how to set up a better brain for your applications — a Contextual Knowledge Base Graph.
spaCy v1.0: Deep Learning with custom pipelines and Keras
I'm pleased to announce the 1.0 release of spaCy, the fastest NLP library in the world. By far the best part of the 1.0 release is a new system for integrating custom models into spaCy. This post introduces you to the changes, and shows you how to use the new custom pipeline functionality to add a Keras-powered LSTM sentiment analysis model into a spaCy pipeline.
An open-source named entity visualiser for the modern web
Named Entity Recognition is a crucial technology for NLP. Whatever you're doing with text, you usually want to handle names, numbers, dates and other entities differently from regular words. To help you make use of NER, we've released displaCy-ent.js. This post explains how the library works, and how to use it.
Introducing Explosion AI
The problem with developing a machine learning model is that you don't know how well it'll work until you try — and trying is very expensive. Obviously, this risk is unappealing, but the existing solution in the market, one-size-fits-all cloud services, are even worse. We're launching Explosion AI to give you a better option.
displaCy.js: An open-source NLP visualiser for the modern web
With new offerings from Google, Microsoft and others, there are now a range of excellent cloud APIs for syntactic dependencies. A key part of these services is the interactive demo, where you enter a sentence and see the resulting annotation. We're pleased to announce the release of displaCy.js, a modern and service-independent visualisation library. We hope this makes it easy to compare different services, and explore your own in-house models.
How front-end development can improve Artificial Intelligence
What's holding back Artificial Intelligence? While researchers rightly focus on better algorithms, there are a lot more things to be done. In this post I'll discuss three ways in which front-end development can improve AI technology: by improving the collection of annotated data, communicating the capabilities of the technology to key stakeholders, and exploring the system's behaviours and errors.
SyntaxNet in context: Understanding Google's new TensorFlow NLP model
Yesterday, Google open sourced their Tensorflow-based dependency parsing library, SyntaxNet. The library gives access to a line of neural network parsing models published by Google researchers over the last two years. I've been following this work closely since it was published, and have been looking forward to the software being published. This post tries to provide some context around the release — what's new here, and how important is it?
spaCy now speaks German
Many people have asked us to make spaCy available for their language. Being based in Berlin, German was an obvious choice for our first second language. Now spaCy can do all the cool things you use for processing English on German text too. But more importantly, teaching spaCy to speak German required us to drop some comfortable but English-specific assumptions about how language works and made spaCy fit to learn more languages in the future.
Multi-threading spaCy's parser and named entity recogniser
In v0.100.3, we quietly rolled out support for GIL-free multi-threading for spaCy's syntactic dependency parsing and named entity recognition models. Because these models take up a lot of memory, we've wanted to release the global interpretter lock (GIL) around them for a long time. When we finally did, it seemed a little too good to be true, so we delayed celebration — and then quickly moved on to other things. It's now past time for a write-up.
Statistical NLP in the Ten Hundred Most Common English Words
When I was little, my favorite TV shows all had talking computers. Now I’m big and there are still no talking computers, so I’m trying to make some myself. Well, we can make computers say things. But when we say things back, they don’t really understand. Why not?
Rebuilding a Website with Modular Markup Components
In a small team, everyone should be able to contribute content to the website and make use of the full set of visual components, without having to worry about design or write complex HTML. To help us write docs, tutorials and blog posts about spaCy, we've developed a powerful set of modularized markup components, implemented using Jade.
Sense2vec with spaCy and Gensim
If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et. al, 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we've found surprisingly addictive.
Dead Code Should Be Buried
Natural Language Processing moves fast, so maintaining a good library means constantly throwing things away. Most libraries are failing badly at this, as academics hate to editorialize. This post explains the problem, why it's so damaging, and why I wrote spaCy to do things differently.
Displaying Linguistic Structure with CSS
One of the features of the relaunch I'm most excited about is the displaCy visualizer and annotation tool. This solves two problems I've thought about a lot: first, how can I help people understand what information spaCy gives them access to? Without a good visualization, the ideas are very abstract. Second, how can we make dependency trees easy for humans to create?
How spaCy Works
This post was pushed out in a hurry, immediately after spaCy was released. It explains some of how spaCy is designed and implemented, and provides some quick notes explaining which algorithms were used. The post pre-dates spaCy's named entity recogniser, but it provides some detail about the tokenisation algorithm, general design, and efficiency concerns.
Computers don't understand text. This is unfortunate, because that's what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorise it, generate it and correct it. spaCy provides a library of utility functions that help programmers build such products.
Writing C in Cython
For the last two years, I’ve done almost all of my work in Cython. And I don’t mean, I write Python, and then “Cythonize” it, with various type-declarations et cetera. I just, write Cython. I use "raw" C structs and arrays, and occasionally C++ vectors, with a thin wrapper around malloc/free that I wrote myself. The code is almost always exactly as fast as C/C++, because that's really all it is, but with Python right there, if I want it.
Parsing English in 500 Lines of Python
This post explains how transition-based dependency parsers work, and argues that this algorithm represents a break-through in natural language understanding. A concise sample implementation is provided, in 500 lines of Python, with no external dependencies. This post was written in 2013. In 2015 this type of parser is now increasingly dominant.
A Good Part-of-Speech Tagger in about 200 Lines of Python
Up-to-date knowledge about natural language processing is mostly locked away in academia. And academics are mostly pretty self-conscious when we write. We’re careful. We don’t want to stick our necks out too much. But under-confident recommendations suck, so here’s how to write a good part-of-speech tagger.