In the run-up to the v1.0 release, we asked the spaCy community to give us their feedback on the library. If you're one of the 224 participants who took part — thanks! Here's what we've learned from your responses, how we're already using them to improve the library, and what we're planning next.
The user survey took place between September 29 and November 6, and had a completion rate of 44%. For a full set of the questions and frequency histograms, check out the Typeform report. In this post, we're taking a closer look at some of the numbers and what they say about spaCy usage and the state of NLP.
Like other NLP libraries, spaCy offers a variety of text annotation modules. Naturally, we want to know which of these modules are most important to people.
It's great to see that all of the library's capabilities are proving useful. The only feature that's important to less than 40% of the participants is the rule matcher, which has only recently been fully documented.
One of the most surprising findings here is the relatively low usage of the word vectors. A common use-case for spaCy should be, "I just want to split the text into tokens, and look up a pre-trained embedding for each token." That's the normal start to most deep learning with text, after all. The low word vector usage suggested that this aspect of the library needed to improve. To address this, we've changed the default word vector data to the 300-dimensional GloVe common crawl word vectors, refined the API for loading custom models, and improved the documentation.
Researchers working on statistical parsing are often some of the most pessimistic about its uptake:
Parsing is really the best example of the type of NLP technology that we want to be practical for people. Overall, 54% of spaCy's users said they found the dependency tree important for their work, and 40% said the same for dependency labels. To understand this a little better, we looked at how usage of the tree and labels differed amongst different types of spaCy users.
Dependency labels add a lot of information about the syntactic structure, that can't be easily recovered from the unlabelled tree. Most applications that make use of the dependency tree could probably also benefit from the labels. We'd like to provide more resources and tutorials that make it easier to take advantage of all information that the dependency parser produces.
We asked participants whether they used spaCy at work. Separately, we also asked what type of work people did. Here you can see the results for two segments: everyone, and non-researchers. Researchers were identified as participants who answered "I use spaCy to produce academic publications" to the question "Which of these best describes how you use spaCy?".
Adding up the production, development and research categories for the non-academics, 102 people told us they were using spaCy at work, and another 30 said they're planning to start. So, what are they all doing at work, if not writing papers?
By far the most common type of commercial activity was software as a service (SaaS). There were also plenty of companies providing consulting services, and a growing segment of companies developing chat bots.
Participants were asked to give spaCy one to five star ratings in four categories. The responses to usability and stability are very encouraging. spaCy is multi-threaded, uses manual memory management, and reads and writes binary data with multiple homebrewed binary serialisation formats. There's a lot that can go wrong! Despite this, we're managing to achieve pretty good stability.
It's clear that the biggest problem with spaCy was the documentation. In particular, the survey participants called for more tutorials and examples. We've since rewritten the docs and introduced a new structure. The biggest improvement has come from finally enforcing a proper separation between the API reference and usage workflows.
There are still some topics missing from the docs, and there's always going to be room for more examples. You can follow us on Twitter to get notified of updates as they're published, or suggest changes on the issue tracker.
We also asked participants to tell us what other NLP software they are working with. Most said they were using multiple open-source libraries, and no cloud services.
Naturally, most people use spaCy alongside a machine learning library. The most popular is Scikit-Learn, followed by TensorFlow and Keras, commonly used together. Similarly, most participants are using spaCy in conjunction with another open-source NLP library, NLTK being the most popular choice, followed by Gensim and CoreNLP.
Most participants said they did not make use of any cloud services for NLP. The clear preference for for open-source libraries may be a sampling effect — after all, we're surveying users of an open-source library. However, this matches up well with the results we've collected so far for our broader State of AI survey.
Of course, we also asked what people would like to see next and how urgently they needed the functionality. A rating of five was described as "I would otherwise have to build this functionality myself". The two most requested features were semantic parsing and named entity disambiguation. Research on both problems is moving forward quickly, making the technologies much more practical.
A named entity disambiguation system returns IDs that reference a knowledge base, such as a Wikipedia link. This lets you finally compute with grounded semantics, instead of using strings that are only meaningful in relation to each other.
Our priorities for the last few months have been to improve the documentation, address the backlog on the issue tracker, and make it easier to do deep learning with spaCy. We're now focussing on adding more languages. spaCy v1.2 adds alpha tokenization support for Chinese, Spanish, French, Italian and Portuguese. The tokenizers need to be told about all the special cases for these languages — we're hoping many hands will make light work.