© Kemal Şanlı

The spaCy user survey: results and analysis

by Matthew Honnibal & Ines Montani on

In the run-up to the v1.0 release, we asked the spaCy community to give us their feedback on the library. If you're one of the 224 participants who took part — thanks! Here's what we've learned from your responses, how we're already using them to improve the library, and what we're planning next.

The user survey took place between September 29 and November 6, and had a completion rate of 44%. For a full set of the questions and frequency histograms, check out the Typeform report. In this post, we're taking a closer look at some of the numbers and what they say about spaCy usage and the state of NLP.

Like other NLP libraries, spaCy offers a variety of text annotation modules. Naturally, we want to know which of these modules are most important to people.

Which of spaCy's capabilities are most important to you?
Tokens 61%Rule matcher 13%Sentence boundaries 40%Word vectors 45%Part-of-speech tags 74%Named Entities 61%Dependency tree 54%Dependeny labels 40%

It's great to see that all of the library's capabilities are proving useful. The only feature that's important to less than 40% of the participants is the rule matcher, which has only recently been fully documented.

One of the most surprising findings here is the relatively low usage of the word vectors. A common use-case for spaCy should be, "I just want to split the text into tokens, and look up a pre-trained embedding for each token." That's the normal start to most deep learning with text, after all. The low word vector usage suggested that this aspect of the library needed to improve. To address this, we've changed the default word vector data to the 300-dimensional GloVe common crawl word vectors, refined the API for loading custom models, and improved the documentation.

Researchers working on statistical parsing are often some of the most pessimistic about its uptake:

Parsing is really the best example of the type of NLP technology that we want to be practical for people. Overall, 54% of spaCy's users said they found the dependency tree important for their work, and 40% said the same for dependency labels. To understand this a little better, we looked at how usage of the tree and labels differed amongst different types of spaCy users.

Who uses spaCy's dependency parser?
Participants using spaCy for non-academic work 60% 38%Paricipants evaluating spaCy for a work project 44% 35%Participants using spaCy to produce academic publications 70% 65%Participants using spaCy for personal projects 53% 42%uses the dependency treeuses dependency labels

Dependency labels add a lot of information about the syntactic structure, that can't be easily recovered from the unlabelled tree. Most applications that make use of the dependency tree could probably also benefit from the labels. We'd like to provide more resources and tutorials that make it easier to take advantage of all information that the dependency parser produces.

We asked participants whether they used spaCy at work. Separately, we also asked what type of work people did. Here you can see the results for two segments: everyone, and non-researchers. Researchers were identified as participants who answered "I use spaCy to produce academic publications" to the question "Which of these best describes how you use spaCy?".

Do you use spaCy at work?
Yes, we use it in production. 20% 18%Yes, we use it in development. 13% 12%Yes, we use it for research. 31% 18%No, but we're planning to. 17% 14%No, and we're currently not planning to. 14% 14%everyone (213)non-academics (177)

Adding up the production, development and research categories for the non-academics, 102 people told us they were using spaCy at work, and another 30 said they're planning to start. So, what are they all doing at work, if not writing papers?

What does your company do?
Software as a service 57% 23%Consulting 28% 9%Bots 21% 6%Mobile or desktop apps 13% 2%Non-software goods and services 11% 2%use spaCy at work (140)use spaCy at work in production

By far the most common type of commercial activity was software as a service (SaaS). There were also plenty of companies providing consulting services, and a growing segment of companies developing chat bots.

Participants were asked to give spaCy one to five star ratings in four categories. The responses to usability and stability are very encouraging. spaCy is multi-threaded, uses manual memory management, and reads and writes binary data with multiple homebrewed binary serialisation formats. There's a lot that can go wrong! Despite this, we're managing to achieve pretty good stability.

How would you rate spaCy?
Accuracy 4.1 / 5Documentation 3.2 / 5Reliability and stability 4.3 / 5Usability and API 4.2 / 5

It's clear that the biggest problem with spaCy was the documentation. In particular, the survey participants called for more tutorials and examples. We've since rewritten the docs and introduced a new structure. The biggest improvement has come from finally enforcing a proper separation between the API reference and usage workflows.

There are still some topics missing from the docs, and there's always going to be room for more examples. You can follow us on Twitter to get notified of updates as they're published, or suggest changes on the issue tracker.

We also asked participants to tell us what other NLP software they are working with. Most said they were using multiple open-source libraries, and no cloud services.

Which other open-source libraries do you use for NLP?
NLTK 70%Scikit-Learn 65%Gensim 57%TensorFlow 42%CoreNLP 28%Keras 27%

Naturally, most people use spaCy alongside a machine learning library. The most popular is Scikit-Learn, followed by TensorFlow and Keras, commonly used together. Similarly, most participants are using spaCy in conjunction with another open-source NLP library, NLTK being the most popular choice, followed by Gensim and CoreNLP.

Who uses SaaS for NLP?
Participants using spaCy at work 22% 11%Paricipants evaluating spaCy for a work project 28% 12%Participants using spaCy to produce academic publications 20% 11%Participants using spaCy for personal projects 18% 6%uses SaaS offerings for NLPuses Google's SaaS offerings for NLP

Most participants said they did not make use of any cloud services for NLP. The clear preference for for open-source libraries may be a sampling effect — after all, we're surveying users of an open-source library. However, this matches up well with the results we've collected so far for our broader State of AI survey.

Of course, we also asked what people would like to see next and how urgently they needed the functionality. A rating of five was described as "I would otherwise have to build this functionality myself". The two most requested features were semantic parsing and named entity disambiguation. Research on both problems is moving forward quickly, making the technologies much more practical.

Which of these future capabilities of spaCy would you be most excited about?
Semantic parsing / slot filling 33% 14%Named entity disambiguation 31% 12%Sentiment analysis 15% 4%Time and date parsing 10% 2%Morphological analysis 5% 1%overall responsesvery important (5 out of 5)
What are semantic parsing and entity disambiguation?Semantic parsing gives you back structured meaning representations that normalize away surface variation. It's useful for a lot of NLP problems, especially for creating chat bots.

A named entity disambiguation system returns IDs that reference a knowledge base, such as a Wikipedia link. This lets you finally compute with grounded semantics, instead of using strings that are only meaningful in relation to each other.

Our priorities for the last few months have been to improve the documentation, address the backlog on the issue tracker, and make it easier to do deep learning with spaCy. We're now focussing on adding more languages. spaCy v1.2 adds alpha tokenization support for Chinese, Spanish, French, Italian and Portuguese. The tokenizers need to be told about all the special cases for these languages — we're hoping many hands will make light work.

Matthew Honnibal
About the Author

Matthew Honnibal

Matthew is a leading expert in AI technology, known for his research, software and writings. He completed his PhD in 2009, and spent a further 5 years publishing research on state-of-the-art natural language understanding systems. Anticipating the AI boom, he left academia in 2014 to develop spaCy, an open-source library for industrial-strength NLP.
Ines Montani
About the Author

Ines Montani

Ines is a developer specialising in web applications for AI technology, letting humans get knowledge to and from machine learning models. She's been working on the spaCy project since its first release. Before founding Explosion AI, she was a freelance front-end developer and strategist, using her four years executive experience in ad sales and digital marketing.

Read more