As the release candidate for spaCy v2.0 gets closer,
we’ve been excited to implement some of the last outstanding features. One of
the best improvements is a new system for adding pipeline components and
registering extensions to the Doc
, Span
and Token
objects. In this post,
we’ll introduce you to the new functionality, and finish with an
example extension package, spacymoji.
Previous versions of spaCy have been fairly difficult to extend. This has been
especially true of the core Doc
, Token
and Span
objects. They’re not
instantiated directly, so creating a useful subclass would involve a lot of ugly
abstraction (think FactoryFactoryConfigurationFactory
classes). Inheritance is
also unsatisfying, because it gives no way to compose different customisations.
We want to let people develop extensions to spaCy, and we want to make sure
those extension can be used together. If every extension required spaCy to
return a different Doc
subclass, there would be no way to do that. To solve
this problem, we’re introducing a new dynamic field that allows new attributes,
properties and methods to be added at run-time:
Setting custom attributes
import spacyfrom spacy.tokens import DocDoc.set_attribute("is_greeting", default=False)nlp = spacy.load("en_core_web_sm")doc = nlp(u"hello world")doc._.is_greeting = True
We think the ._
attribute strikes a nice balance between readability and
explicitness. Extensions need to be nice to use, but it should also be obvious
what is and isn’t built-in – otherwise there’s no way to track down the
documentation or implementation of the code you’re reading. The ._
attribute
also makes sure that updates to spaCy won’t break extension code through
namespace conflicts.
The other thing that’s been missing for extension development was a convenient
way of modifying the processing pipeline. Early versions of spaCy hard-coded the
pipeline, because only English was supported. spaCy v1.0 allowed the pipeline to
be changed at run-time, but this has been mostly hidden away from the user:
you’d call nlp
on a text and stuff happens – but what? If you needed to add
a process that should run between tagging and parsing, you’d have to dig into
spaCy’s internals. In spaCy v2.0 there’s finally an API for that, and it’s as
simple as:
Adding ucstom components to the pipeline
nlp = spacy.load("en_core_web_sm")component = MyComponent()nlp.add_pipe(component, after="tagger")doc = nlp(u"This is a sentence")
Custom pipeline components
Fundamentally, a pipeline is a list of functions called on a Doc
in order. The
pipeline can be set by a model, and modified by the user. A pipeline component
can be a complex class that holds state, or a very simple Python function that
adds something to a Doc
and returns it. Under the hood, spaCy performs the
following steps when you call nlp
on a string of text:
How the pipeline works
doc = nlp.make_doc(u'This is a sentence') # create a Doc from raw textfor name, proc in nlp.pipeline: # iterate over components in orderdoc = proc(doc) # call each component on the Doc
The nlp
object is an instance of Language
, which contains the data and
annotation scheme of the language you’re using and a pre-defined pipeline of
components, like the tagger, parser and entity recognizer. If you’re loading
a model, the Language
instance also has
access to the model’s binary data. All of this is specific to each model, and
defined in the model’s meta.json
– for example, a Spanish NER model requires
different weights, language data and pipeline components than an English parsing
and tagging model. This is also why the pipeline state is always held by the
Language
class. spacy.load()
puts this all together and returns an instance
of Language
with a pipeline set and access to the binary data.
A spaCy pipeline in v2.0 is
simply a list of (name, function)
tuples, describing the component name and
the function to call on the Doc
object:
>>> nlp.pipeline[('tagger', <spacy.pipeline.Tagger></spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
To make it more convenient to modify the pipeline, there are several
built-in methods to get, add,
replace, rename or remove individual components. spaCy’s default pipeline
components, like the tagger, parser and entity recognizer now all follow the
same, consistent API and are subclasses of
Pipe
. If you’re developing your own component, using the Pipe
API will make
it fully trainable and serializable. At a minimum, a component needs to be a
callable that takes a Doc
and returns it:
A simple custom component
def my_component(doc):print("The doc is {} characters long and has {} tokens.").format(len(doc.text), len(doc))return doc
The component can then be added at any position of the pipeline using the
nlp.add_pipe()
method. The arguments before
, after
, first
, and last
let you specify component names to insert the new component before or after, or
tell spaCy to insert it first (i.e. directly after tokenization) or last in the
pipeline.
nlp = spacy.load("en_core_web_sm")nlp.add_pipe(my_component, name="print_length", last=True)doc = nlp(u"This is a sentence.")
Extension attributes on Doc, Token and Span
When you implement your own pipeline components that modify the Doc
, you often
want to extend the API, so that the information you’re adding is conveniently
accessible. spaCy v2.0 introduces a
new mechanism
that lets you register your own attributes, properties and methods that become
available in the ._
namespace, for example, doc._.my_attr
. There are mostly
three types of extensions that can be registered via the set_extension()
method:
- Attribute extensions. Set a default value for an attribute, which can be overwritten.
- Property extensions. Define a getter and an optional setter function.
- Method extensions. Assign a function that becomes available as an object method.
Adding custom extensions, properties and methods
Doc.set_extension("hello_attr", default=True)Doc.set_extension("hello_property", getter=get_value, setter=set_value)Doc.set_extension("hello_method", method=lambda doc, name: "Hi {}!".format(name))doc._.hello_attr # Truedoc._.hello_property # return value of get_valuedoc._.hello_method("Ines") # 'Hi Ines!'
Being able to easily write custom data to the Doc
, Token
and Span
means
that applications using spaCy can take full advantage of the built-in data
structures and the benefits of Doc
objects as the single source of truth
containing all information:
- No information is lost during tokenization and parsing, so you can always relate annotations to the original string.
- The
Token
andSpan
are views of theDoc
, so they’re always up-to-date and consistent. - Efficient C-level access is available to the underlying
TokenC*
array viadoc.c
. - APIs can standardise on passing around
Doc
objects, reading and writing from them whenever necessary. Fewer signatures makes functions more reusable and composable.
For example, lets say your data contains geographical information like country names, and you’re using spaCy to extract those names and add more details, like the country’s capital or GPS coordinates. Or maybe your application needs to find names of public figures using spaCy’s named entity recognizer, and check if a page about them exists on Wikipedia.
Before, you’d usually run spaCy over your text to get the information you’re
interested in, save it to a database and add more data to it later. This worked
well, but it also meant that you lost all references to the original document.
Alternatively, you could serialize your document and store the additional data
with references to their respective token indices. Again, this worked well, but
it was a pretty unsatisfying solution overall. In spaCy v2.0, you can simply
write all this data to custom attributes on a document, token or span, using
a name of your choice. For example, token._.country_capital
,
span._.wikipedia_url
or doc._.included_persons
.
The following example shows a simple pipeline component that fetches all
countries using the REST Countries API, finds the
country names in the document, merges the matched spans, assigns the entity
label GPE
(geopolitical entity) and adds the country’s capital,
latitude/longitude coordinates and a boolean is_country
to the token
attributes. You can also find a
more detailed version
on GitHub.
Countries extension
import requestsfrom spacy.tokens import Token, Spanfrom spacy.matcher import PhraseMatcherclass Countries(object):name = 'countries' # component name shown in pipelinedef __init__(self, nlp, label="GPE"):# request all country data from the APIr = requests.get("https://restcountries.eu/rest/v2/all")self.countries = {c['name']: c for c in r.json()} # create dict for easy lookup# initialise the matcher and add patterns for all country namesself.matcher = PhraseMatcher(nlp.vocab)self.matcher.add("COUNTRIES", None, *[nlp(c) for c in self.countries.keys()])self.label = nlp.vocab.strings[label] # get label ID from vocab# register extensions on the TokenToken.set_extension("is_country", default=False)Token.set_extension("country_capital")Token.set_extension("country_latlng")def __call__(self, doc):matches = self.matcher(doc)spans = [] # keep the spans for later so we can merge them afterwardsfor _, start, end in matches:# create Span for matched country and assign labelentity = Span(doc, start, end, label=self.label)spans.append(entity)for token in entity: # set values of token attributestoken._.set("is_country", True)token._.set("country_capital", self.countries[entity.text]["capital"])token._.set("country_latlng", self.countries[entity.text]["latlng"])doc.ents = list(doc.ents) + spans # overwrite doc.ents and add entities – don't replace!for span in spans:span.merge() # merge all spans at the end to avoid mismatched indicesreturn doc # don't forget to return the Doc!
The example also uses spaCy’s PhraseMatcher
, which is another cool feature
introduced in v2.0. Instead of token patterns, the phrase matcher can take a
list of Doc
objects, letting you match large terminology lists fast and
efficiently. When you add the component to the pipeline and process a text, all
countries are automatically labelled as GPE
entities, and the custom
attributes are available on the token:
nlp = spacy.load("en_core_web_sm")component = Countries(nlp)nlp.add_pipe(component, before="tagger")doc = nlp(u"Some text about Colombia and the Czech Republic")
print([(ent.text, ent.label_) for ent in doc.ents])# [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]
print([(token.text, token._.country_capital) for token in doc if token._.is_country])# [('Colombia', 'Bogotá'), ('Czech Republic', 'Prague')]
Using getters and setters, you can also implement attributes on the Doc
and
Span
that reference custom Token
attributes – for example, whether a
document contains countries. Since the getter is only called when you access the
attribute, you can refer to the Token
’s is_country
attribute here, which is
already set in the processing step. For a complete implementation, see the
full example.
has_country = lambda tokens: any([token._.is_country for token in tokens])Doc.set_extension("has_country", getter=has_country)Span.set_extension("has_country", getter=has_country)
spaCy extensions
Having a straightforward API for custom extensions and a clearly defined
input/output (Doc
/Doc
) also helps making larger code bases more
maintainable, and allows developers to share their extensions with others and
test them reliably. This is relevant for teams working with spaCy, but also for
developers looking to publish their own packages, extensions and plugins.
We’re hoping that this new architecture will help encourage a community ecosystem of spaCy components to cover any potential use case – no matter how specific. Components can range from simple extensions adding fairly trivial attributes for convenience, to complex models making use of external libraries such as PyTorch, scikit-learn and TensorFlow. There are many components users may want, and we’d love to be able to offer more built-in pipeline components shipped with spaCy – for example, better sentence boundary detection, semantic role labelling and sentiment analysis. But there’s also a clear need for making spaCy extensible for specific use cases, making it interoperate better with other libraries, and putting all of it together to update and train statistical models.
Example: Emoji handling with spacymoji
Adding better emoji support to spaCy has long been on my list of “cool things to
build sometime”. Emoji are fun, hold a lot of relevant semantic information and,
supposedly, are
now more common in Twitter text than hyphens. Over the past two years, they have
also become vastly more complex. Aside from the regular emoji characters and
their unicode representations, you can now also use skin tone modifiers that are
placed after a regular emoji, and result in one visible character. For
example, 👍 + 🏿 = 👍🏿. In addition, some characters can form
”ZWJ sequences”,
e.g. two or more emoji joined by a Zero Width Joiner (U+200D
) that are merged
into one symbol. For example, 👨 + ZWJ
+ 🎤 = 👨🎤 (official title is “man
singer”, I call it “Bowie”).
As of v2.0, spaCy’s tokenizer splits all emoji and other symbols into individual
tokens, making them easier to separate from the rest of your text. However,
emoji unicode ranges are fairly arbitrary and updated often. The
\p{Other_Symbol}
or \p{So}
category, which spaCy’s tokenizer uses, is a good
approximation, but it also includes other icons and dingbats. So if you want to
handle only emoji, there’s no way around matching against an exact list.
Luckily, the emoji package has us covered
here.
spacymoji is a spaCy extension and pipeline
component that detects individual emoji and sequences in your text, merges them
into one token and assigns custom attributes to the Doc
, Span
and Token
.
For example, you can check if a document or span includes an emoji, check
whether a token is an emoji and retrieve its human-readable description.
import spacyfrom spacymoji import Emoji
nlp = spacy.load('en')emoji = Emoji(nlp)nlp.add_pipe(emoji, first=True)
doc = nlp(u"This is a test 😻 👍🏿")assert doc._.has_emojiassert len(doc._.emoji) == 2assert doc[2:5]._.has_emojiassert doc[4]._.is_emojiassert doc[5]._.emoji_desc == u'thumbs up dark skin tone'assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')
The spacymoji component uses the PhraseMatcher
to find occurences of the exact
emoji sequences in the emoji lookup table and generates the respective emoji
spans. It also merges them into one token if the emoji consists of more than one
character – for example, an emoji with a skin tone modifier or a combined ZWJ
sequence. The emoji shortcut, e.g. :thumbs_up:
, is converted to a
human-readable description, available as token._.emoji_desc
. You can also pass
in your own lookup table, mapping emoji to custom descriptions.
Next steps
If you feel inspired and want to build you own extension, see this guide for some tips, tricks and best practices. With the growth of deep learning tools and techniques, there are now lots of models for predicting various types of NLP annotations. Models for tasks like coreference resolution, information extraction and summarization can now easily be used to power spaCy extensions – all you have to do is add the extension attributes, and hook the model into the pipeline. We’re looking forward to seeing what you build!