Spancat: a new approach for span labeling

The SpanCategorizer is a spaCy component that answers the NLP community’s need to have structured annotation for a wide variety of labeled spans, including long phrases, non-named entities, or overlapping annotations. In this blog post, we’re excited to talk more about spancat and showcase new features to help with your span labeling needs!

A large portion of the NLP community treats span labeling as an entity extraction problem. However, these two are not the same. Entities typically have clear token boundaries and are comprised of syntactic units like proper nouns. Meanwhile, spans can be arbitrary, often consisting of noun phrases and sentence fragments. Occasionally, these spans even overlap! Take this for example:

You might have encountered spans like this in your work --- long, fragmented, and nested.

The FACTOR label covers a large span of tokens that is unusual in standard NER. Most ner entities are short and distinguishable, but this example has long and vague ones. Also, tokens such as “septic,” “shock,” and “bacteremia” belong to more than one span, rendering them incompatible with spaCy’s ner component.

The text above is just one of the many examples you’ll find in span labeling. So during its v3.1 release, spaCy introduced the SpanCategorizer, a new component that handles arbitrary and overlapping spans. Unlike the EntityRecognizer, the spancat component provides more flexibility in how you tackle your entity extraction problems. It does this by:

Explicit control of candidate spans. Users can define rules on how spancat obtains likely candidates via suggester functions. You can bias your model towards precision or recall through this, customizing it further depending on your use case.
Access to confidence scores. Unlike ner, a spancat model returns predicted label probabilities over the whole span, giving us more meaningful confidences to threshold against. You can use this value to filter your results and tune the performance of your system.
Less edge-sensitivity. Sequence-based NER models typically predict single token-based tags that are very sensitive to boundaries. Although effective for proper nouns and self-contained expressions, it is less useful for other types of phrases or overlapping spans.

How spancat works

From a high-level perspective, we can divide spancat into two parts: the suggester and the classifier. The suggester is a custom function that extracts possible span candidates from the text and feeds them into the classifier. These suggester functions can be completely rule-based, depend on annotations from other components, or use machine learning approaches.

The suggested spans go into the classifier, which then predicts the correct label for every span. By including the context of the whole span, the classifier can catch informative words deep within it. We can divide the classifier into three steps: Embedding, Pooling, and Scoring.

Embedding: we obtain the tok2vec representation of the candidate spans so we can work on them numerically.
Pooling: we reduce the sequences to make the model more robust, then encode the context using a window encoder.
Scoring: we perform multilabel classification on these pooled spans, thereby returning model predictions and label probabilities.

What’s nice about spancat is that it offers enough flexibility for you to customize its architecture. For example, you can swap out the MaxOutWindowEncoder into a MishWindowEncoder by simply updating the config file!

Architecture case study on nested NER

We developed a spaCy project that demonstrates how the architectural differences between ner and spancat led to vastly different approaches to the same problem. We tackle a common use case in span labeling—nested NER. Here, tokens can be part of multiple spans simultaneously, owing to the hierarchical or multilabel nature of that dataset. Take this sentence, for example:

This text came from GENIA, a corpus of biomedical literature lifted from Medline abstracts. It contains five labels—DNA, RNA, cell line, cell type, and protein. Here, the span “Human IL4” is by itself a protein. Additionally, “Human IL4 promoter” is a DNA sequence that binds to that specific protein to initiate a process called transcription. Both spans contain the tokens “Human” and “IL4”, so we should assign multiple labels to both, resulting in nested spans.

Suppose we attempt this problem using the EntityRecognizer. In that case, we have to train one model for each entity type, and then write a custom component that combines them altogether into a coherent set. This approach adds significant complexity as the number of models scales with the number of your labels. It works, but far from ideal:

Another option is to lean into the ner paradigm and preprocess the dataset to remove its nestedness. This approach might involve rethinking our annotations (perhaps combining the overlapping span types as a new label or eliminating them altogether) and turning the nested NER problem into a non-nested entity extraction one. The resulting model won’t predict nested entities anymore.

With spancat, we can store all spans in a single Doc, and train a single model for all. In spaCy, we can store these spans inside the Doc.spans attribute under a specified key. This approach then makes experimentation much more convenient:

We’re currently working on a more detailed performance comparison using various datasets outside the nested NER use case.

New features

With this blog post, we’re bringing some nice additions to spancat! 🎉

Analyze and debug your spancat datasets

From v3.3.1 onwards, spaCy’s debug data command fully supports validation and analysis of your spancat datasets. If you have a spancat component in your pipeline, you’ll receive insights such as the size of your spans, potential overlaps between training and evaluation sets, and more!

Output of debug data on the NERGrit (IndoNLU) corpus

In addition, debug data provides further analyses of your dataset. For example, you can determine how long your spans are or how distinct they are compared to the rest of the corpus. You can then use this information to refine your spancat configuration further.¹

Generate a spancat config with sensible defaults

The spaCy training quickstart allows you to generate config.cfg files and lets you choose the individual components for your pipeline with the recommended settings. We’ve updated the configuration for spancat, giving you more sensible defaults to improve your training!

The spancat configuration as seen in the quickstart

Visualize overlapping spans

We’ve added support for spancat in our visualization library, displaCy. If your dataset has overlapping spans or some hierarchical structure, you can use the new "span" style to display them.

Using displaCy to visualize overlapping spans
text = "Welcome to the Bank of China."
doc = nlp(text)
doc.spans["sc"] = [
    Span(doc, 3, 6, "ORG"),
    Span(doc, 5, 6, "GPE")
]
displacy.serve(doc, style="span")

Similar to other displaCy styles, you’re free to configure the look and feel of the span annotations. In this example, we’ve color-coded the spans based on their type:

New array of span suggester functions

We’ve also added three new rule-based suggester functions in the 0.5.0 release of our spacy-experimental repository that depend on annotations from other components.

The subtree-suggester uses dependency annotation to suggest tokens with their syntactic descendants.

The chunk-suggester suggests noun chunks using the noun chunk iterator, which requires POS and dependency annotation.

The sentence-suggester uses sentence boundaries to suggest sentence spans as candidates.

These suggesters also come with the functionality to suggest n-grams spans in addition. Try them out in this spaCy project that showcases how to include them in your pipeline!

Learn span boundaries using SpanFinder

The SpanFinder is a new experimental component that identifies span boundaries by tagging potential start and end tokens. It’s an ML approach to suggest fewer but more precise span candidates than the ngram suggester.

When using the ngram suggester, the number of suggested candidates can get high, slowing down spancat training and increasing its memory usage. We designed the SpanFinder to produce fewer candidates to solve these problems. In practice, it makes sense to experiment with different suggester functions to determine what works best for your specific use-case.

We’ve prepared a spaCy project that showcases how to use the SpanFinder and how it compares to the ngram suggester on the GENIA dataset:

Comparison between n-gram and SpanFinder suggesters on GENIA

Metric	SpanFinder	n-gram (1-10)
F-score	0.7122	0.7201
Precision	0.7672	0.7595
Recall	0.6646	0.6847
Suggested candidates	13,754	486,903
Actual entities	5,474	5,474
% Ratio	251%	8894%
% Coverage	75.61%	99.53%
Speed (tokens/sec)	10,465	4,807

Final thoughts

As we encounter several NLP use cases from the community, we recognize that entity extraction is limited in solving the majority of span labeling problems. Spancat is our answer to this challenge. You now have more tools to analyze, train, and visualize your spans with these new features. You can use this as an alternative to NER or as an additional component in your existing pipeline.

You can find out more about span categorization in the spaCy docs, and be sure to check out our experimental suggesters and example projects. We also appreciate community feedback, so we hope to see you in our discussions forum soon!

Most of the span characteristics were based on Papay et al.’s work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020). ↩