© Kemal Şanlı & Freepik

Introducing custom pipelines and extensions for spaCy v2.0

by Ines Montani on

As the release candidate for spaCy v2.0 gets closer, we've been excited to implement some of the last outstanding features. One of the best improvements is a new system for adding pipeline components and registering extensions to the Doc, Span and Token objects. In this post, we'll introduce you to the new functionality, and finish with an example extension package, spacymoji.spaCy v2.0 alphaspaCy is an open-source library for advanced Natural Language Processing in Python. The new version is available on pip via spacy-nightly. To try out the examples in this post, you need the latest version, 2.0.0a17. See this page for details on the new features. For an overview of the new models, see the models directory.

Previous versions of spaCy have been fairly difficult to extend. This has been especially true of the core Doc, Token and Span objects. They're not instantiated directly, so creating a useful subclass would involve a lot of ugly abstraction (think FactoryFactoryConfigurationFactory classes). Inheritance is also unsatisfying, because it gives no way to compose different customisations. We want to let people develop extensions to spaCy, and we want to make sure those extension can be used together. If every extension required spaCy to return a different Doc subclass, there would be no way to do that. To solve this problem, we're introducing a new dynamic field that allows new attributes, properties and methods to be added at run-time:

import spacy
from spacy.tokens import Doc

Doc.set_attribute('is_greeting', default=False)

nlp = spacy.load('en')
doc = nlp(u'hello world')
doc._.is_greeting = True

We think the ._ attribute strikes a nice balance between readability and explicitness. Extensions need to be nice to use, but it should also be obvious what is and isn't built-in – otherwise there's no way to track down the documentation or implementation of the code you're reading. The ._ attribute also makes sure that updates to spaCy won't break extension code through namespace conflicts.

The other thing that's been missing for extension development was a convenient way of modifying the processing pipeline. Early versions of spaCy hard-coded the pipeline, because only English was supported. spaCy v1.0 allowed the pipeline to be changed at run-time, but this has been mostly hidden away from the user: you'd call nlp on a text and stuff happens – but what? If you needed to add a process that should run between tagging and parsing, you'd have to dig into spaCy's internals. In spaCy v2.0 there's finally an API for that, and it's as simple as:

nlp = spacy.load('en')
component = MyComponent()
nlp.add_pipe(component, after='tagger')
doc = nlp(u"This is a sentence")

Fundamentally, a pipeline is a list of functions called on a Doc in order. The pipeline can be set by a model, and modified by the user. A pipeline component can be a complex class that holds state, or a very simple Python function that adds something to a Doc and returns it. Under the hood, spaCy performs the following steps when you call nlp on a string of text:

doc = nlp.make_doc(u'This is a sentence')   # create a Doc from raw text
for name, proc in nlp.pipeline:             # iterate over components in order
    doc = proc(doc)                         # call each component on the Doc

The nlp object is an instance of Language, which contains the data and annotation scheme of the language you're using and a pre-defined pipeline of components, like the tagger, parser and entity recognizer. If you're loading a model, the Language instance also has access to the model's binary data. All of this is specific to each model, and defined in the model's meta.json – for example, a Spanish NER model requires different weights, language data and pipeline components than an English parsing and tagging model. This is also why the pipeline state is always held by the Language class. spacy.load() puts this all together and returns an instance of Language with a pipeline set and access to the binary data.

A spaCy pipeline in v2.0 is simply a list of (name, function) tuples, describing the component name and the function to call on the Doc object:

>>> nlp.pipeline
[('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>),
 ('ner', <spacy.pipeline.EntityRecognizer>)]

To make it more convenient to modify the pipeline, there are several built-in methods to get, add, replace, rename or remove individual components. spaCy's default pipeline components, like the tagger, parser and entity recognizer now all follow the same, consistent API and are subclasses of Pipe. If you're developing your own component, using the Pipe API will make it fully trainable and serializable. At a minimum, a component needs to be a callable that takes a Doc and returns it:

def my_component(doc):
    print("The doc is {} characters long and has {} tokens."
          .format(len(doc.text), len(doc))
    return doc

The component can then be added at any position of the pipeline using the nlp.add_pipe() method. The arguments before, after, first, and last let you specify component names to insert the new component before or after, or tell spaCy to insert it first (i.e. directly after tokenization) or last in the pipeline.

nlp = spacy.load('en')
nlp.add_pipe(my_component, name='print_length', last=True)
doc = nlp(u"This is a sentence.")

When you implement your own pipeline components that modify the Doc, you often want to extend the API, so that the information you're adding is conveniently accessible. spaCy v2.0 introduces a new mechanism that lets you register your own attributes, properties and methods that become available in the ._ namespace, for example, doc._.my_attr. There are mostly three types of extensions that can be registered via the set_extension() method:Why ._?Writing to a ._ attribute instead of to the Doc directly keeps a clearer separation and makes it easier to ensure backwards compatibility. For example, if you've implemented your own .coref property and spaCy claims it one day, it'll break your code. Similarly, just by looking at the code, you'll immediately know what's built-in and what's custom – for example, doc.sentiment is spaCy, while doc._.sent_score isn't.

  1. Attribute extensions. Set a default value for an attribute, which can be overwritten.
  2. Property extensions. Define a getter and an optional setter function.
  3. Method extensions. Assign a function that becomes available as an object method.
Doc.set_extension('hello_attr', default=True)
Doc.set_extension('hello_property', getter=get_value, setter=set_value)
Doc.set_extension('hello_method', method=lambda doc, name: 'Hi {}!'.format(name))

doc._.hello_attr            # True
doc._.hello_property        # return value of get_value
doc._.hello_method('Ines')  # 'Hi Ines!'

Being able to easily write custom data to the Doc, Token and Span means that applications using spaCy can take full advantage of the built-in data structures and the benefits of Doc objects as the single source of truth containing all information:

  • No information is lost during tokenization and parsing, so you can always relate annotations to the original string.
  • The Token and Span are views of the Doc, so they're always up-to-date and consistent.
  • Efficient C-level access is available to the underlying TokenC* array via doc.c.
  • APIs can standardise on passing around Doc objects, reading and writing from them whenever necessary. Fewer signatures makes functions more reusable and composable.

For example, lets say your data contains geographical information like country names, and you're using spaCy to extract those names and add more details, like the country's capital or GPS coordinates. Or maybe your application needs to find names of public figures using spaCy's named entity recognizer, and check if a page about them exists on Wikipedia.

Before, you'd usually run spaCy over your text to get the information you're interested in, save it to a database and add more data to it later. This worked well, but it also meant that you lost all references to the original document. Alternatively, you could serialize your document and store the additional data with references to their respective token indices. Again, this worked well, but it was a pretty unsatisfying solution overall. In spaCy v2.0, you can simply write all this data to custom attributes on a document, token or span, using a name of your choice. For example, token._.country_capital, span._.wikipedia_url or doc._.included_persons.

The following example shows a simple pipeline component that fetches all countries using the REST Countries API, finds the country names in the document, merges the matched spans, assigns the entity label GPE (geopolitical entity) and adds the country's capital, latitude/longitude coordinates and a boolean is_country to the token attributes. You can also find a more detailed version on GitHub.

Countries extensionimport requests
from spacy.tokens import Token, Span
from spacy.matcher import PhraseMatcher

class Countries(object):
    name = 'countries'  # component name shown in pipeline

    def __init__(self, nlp, label='GPE'):
        # request all country data from the API
        r = requests.get('https://restcountries.eu/rest/v2/all')
        self.countries = {c['name']: c for c in r.json()}  # create dict for easy lookup
        # initialise the matcher and add patterns for all country names
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('COUNTRIES', None, *[nlp(c) for c in self.countries.keys()])
        self.label = nlp.vocab.strings[label] # get label ID from vocab
        # register extensions on the Token
        Token.set_extension('is_country', default=False)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # create Span for matched country and assign label
            entity = Span(doc, start, end, label=self.label)
            for token in entity:  # set values of token attributes
                token._.set('is_country', True)
                token._.set('country_capital', self.countries[entity.text]['capital'])
                token._.set('country_latlng', self.countries[entity.text]['latlng'])
        doc.ents = list(doc.ents) + spans  # overwrite doc.ents and add entities – don't replace!
        for span in spans:
            span.merge()  # merge all spans at the end to avoid mismatched indices
        return doc  # don't forget to return the Doc!

The example also uses spaCy's PhraseMatcher, which is another cool feature introduced in v2.0. Instead of token patterns, the phrase matcher can take a list of Doc objects, letting you match large terminology lists fast and efficiently. When you add the component to the pipeline and process a text, all countries are automatically labelled as GPE entities, and the custom attributes are available on the token:

nlp = spacy.load('en')
component = Countries(nlp)
nlp.add_pipe(component, before='tagger')
doc = nlp(u"Some text about Colombia and the Czech Republic")

print([(ent.text, ent.label_) for ent in doc.ents])
# [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

print([(token.text, token._.country_capital) for token in doc if token._.is_country])
# [('Colombia', 'Bogotá'), ('Czech Republic', 'Prague')]

Using getters and setters, you can also implement attributes on the Doc and Span that reference custom Token attributes – for example, whether a document contains countries. Since the getter is only called when you access the attribute, you can refer to the Token's is_country attribute here, which is already set in the processing step. For a complete implementation, see the full example.Other ideasIn this case, we are able to fetch all data with one request to the REST API. However, you can also implement API requests via getter functions on individual objects, or add a method attribute to pass in additional parameters. Or how about a Token method that takes another country name or GPS coordinates, and computes the distance to the token's country? This is all possible now!

has_country = lambda tokens: any([token._.is_country for token in tokens])
Doc.set_extension('has_country', getter=has_country)
Span.set_extension('has_country', getter=has_country)

Having a straightforward API for custom extensions and a clearly defined input/output (Doc/Doc) also helps making larger code bases more maintainable, and allows developers to share their extensions with others and test them reliably. This is relevant for teams working with spaCy, but also for developers looking to publish their own packages, extensions and plugins.

We're hoping that this new architecture will help encourage a community ecosystem of spaCy components to cover any potential use case – no matter how specific. Components can range from simple extensions adding fairly trivial attributes for convenience, to complex models making use of external libraries such as PyTorch, scikit-learn and TensorFlow. There are many components users may want, and we'd love to be able to offer more built-in pipeline components shipped with spaCy – for example, better sentence boundary detection, semantic role labelling and sentiment analysis. But there's also a clear need for making spaCy extensible for specific use cases, making it interoperate better with other libraries, and putting all of it together to update and train statistical models.

Adding better emoji support to spaCy has long been on my list of "cool things to build sometime". Emoji are fun, hold a lot of relevant semantic information and, supposedly, are now more common in Twitter text than hyphens. Over the past two years, they have also become vastly more complex. Aside from the regular emoji characters and their unicode representations, you can now also use skin tone modifiers that are placed after a regular emoji, and result in one visible character. For example, 👍 + 🏿 = 👍🏿. In addition, some characters can form "ZWJ sequences", e.g. two or more emoji joined by a Zero Width Joiner (U+200D) that are merged into one symbol. For example, 👨 + ZWJ + 🎤 = 👨‍🎤 (official title is "man singer", I call it "Bowie").

As of v2.0, spaCy's tokenizer splits all emoji and other symbols into individual tokens, making them easier to separate from the rest of your text. However, emoji unicode ranges are fairly arbitrary and updated often. The \p{Other_Symbol} or \p{So} category, which spaCy's tokenizer uses, is a good approximation, but it also includes other icons and dingbats. So if you want to handle only emoji, there's no way around matching against an exact list. Luckily, the emoji package has us covered here.

spacymoji is a spaCy extension and pipeline component that detects individual emoji and sequences in your text, merges them into one token and assigns custom attributes to the Doc, Span and Token. For example, you can check if a document or span includes an emoji, check whether a token is an emoji and retrieve its human-readable description.

import spacy
from spacymoji import Emoji

nlp = spacy.load('en')
emoji = Emoji(nlp)
nlp.add_pipe(emoji, first=True)

doc  = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji
assert len(doc._.emoji) == 2
assert doc[2:5]._.has_emoji
assert doc[4]._.is_emoji
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')
Pipeline positionBy adding the component as the first in the pipeline, the spans are merged right after tokenization, and before the document is parsed. If your text contains a lot of emoji, this might even give you a nice boost in parser accuracy, as the parser only gets to see one token per emoji.

The spacymoji component uses the PhraseMatcher to find occurences of the exact emoji sequences in the emoji lookup table and generates the respective emoji spans. It also merges them into one token if the emoji consists of more than one character – for example, an emoji with a skin tone modifier or a combined ZWJ sequence. The emoji shortcut, e.g. :thumbs_up:, is converted to a human-readable description, available as token._.emoji_desc. You can also pass in your own lookup table, mapping emoji to custom descriptions.

If you feel inspired and want to build you own extension, see this guide for some tips, tricks and best practices. With the growth of deep learning tools and techniques, there are now lots of models for predicting various types of NLP annotations. Models for tasks like coreference resolution, information extraction and summarization can now easily be used to power spaCy extensions – all you have to do is add the extension attributes, and hook the model into the pipeline. We're looking forward to seeing what you build!

Ines Montani
About the Author

Ines Montani

Ines is a developer specialising in web applications for AI technology. She's a core developer of the spaCy Natural Language Processing library and Prodigy, an annotation tool for radically efficient machine teaching. Before founding Explosion AI, she was a freelance front-end developer and strategist, using her four years executive experience in ad sales and digital marketing.

Read more