Introducing Holmes 4.0A few weeks ago we released version 4.0 of Holmes, which we are now able to offer under a permissive MIT license. Holmes is a library in the spaCy Universe that runs on top of spaCy and enables information extraction and intelligent search, currently for English and German. Holmes goes beyond simple matching algorithms and allows you to look for a specified idea or ideas in a corpus of documents.
Spancat: a new approach for span labelingThe SpanCategorizer is a spaCy component that answers the NLP community's need to have structured annotation for a wide variety of labeled spans, including long phrases, non-named entities, or overlapping annotations. In this blog post, we're excited to talk more about spancat and showcase new features to help with your span labeling needs!
Solutions for Advanced NLP for Diverse LanguagesNew Languages for NLP KeynoteThis talk discusses spaCy’s philosophy for modern NLP, its extensible design and new recent features to enable the development of advanced natural language processing pipelines for typologically diverse languages.
Compact word vectors with Bloom embeddingsAn introduction to the compact word vectors with Bloom embeddings used in Thinc, spaCy and floret.
Applied Language TechnologyExtensive online course on applied language technology with spaCy by Tuomo Hiippala, designed for students new to NLP and programming.
Explosion in 2021: Our Year in ReviewThe year 2021 is coming to an end, and like the previous year, it was shaped by unique challenges that impacted our work together. For Explosion, it was a very productive year. We found an investor that fits our strategy, the work on Prodigy Teams is in full swing, and the team has grown a lot. So here's our look back at our highlights of the year 2021.
Talking sense: using machine learning to understand quotesThe Guardian BlogHow the Guardian uses spaCy and Prodigy to train a machine learning model that helps extract quotes from news articles and match them to the correct source.
We’ve sold 5% of ExplosionSince founding Explosion in 2016, we’ve run the company as a profitable business and we decided to only consider external investment if we could find a deal that wouldn’t compromise the direction or stability of the company. We’re pleased to announce that we’ve found an investment that ticks all the boxes.
✨ prodigy v1.11.0Aug 12, 2020spaCy v3 support, annotation for overlapping and nested spans, better installation & more
Applied NLP Thinking: How to Translate Problems into SolutionsWe’ve been running Explosion for about five years now, which has given us a lot of insights into what Natural Language Processing looks like in industry contexts. In this blog post, I’m going to discuss some of the biggest challenges for applied NLP and translating business problems into machine learning solutions.
Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat ViolenceHalterman, Keith, Sarwar, O’Connor (2021), ACL 2021Figure A2 shows a stylized version of the custom interface we built using the Prodigy annotation tool. Annotators are presented with an entire document, with sentences sequentially highlighted.
How We Found Pricey Provisions in New Jersey Police ContractsProPublicaProPublica and the Asbury Park Press scoured hundreds of police union agreements for details on publicly funded payouts to cops, using spaCy under the hood.
Introducing spaCy v3.0spaCy v3.0 is a huge release! It features new transformer-based pipelines that get spaCy's accuracy right up to the current state-of-the-art, and a new workflow system to help you take projects from prototype to production. It's much easier to configure and train your pipeline, and there are lots of new and improved integrations with the rest of the NLP ecosystem.
How We Analyzed Google’s Search ResultsThe MarkupUsing the Prodigy annotation tool, we created a user interface and a coder manual for two annotators to spot-check 741 stained images randomly sampled from our dataset.
Introducing spaCy v3.4spaCy v3.4 brings typing and speed improvements along with new vectors for English CNN pipelines and new trained pipelines for Croatian.
Introducing spaCy v3.3spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
Automated Identification of Clinical Procedures in Free-Text Electronic Clinical Records with a Low-Code Named Entity Recognition WorkflowMacri, Teoh, Bacchi, Sun, Selva, Casson, Chan (2022), Methods of Information in MedicineThe use of a low-code annotation software tool [Prodigy] allows the rapid creation of a custom annotation dataset to train a NER model to identify clinical procedures stored in free-text electronic clinical notes.
Introducing spaCy Tailored PipelinesExplosion is pleased to announce a new development services offering, spaCy Tailored Pipelines. We’ll build you a custom natural language processing pipeline, delivered in a standardized format using spaCy’s projects system.
Healthsea: an end-to-end spaCy pipeline for exploring health supplement effectsCreate better access to health with machine learning and natural language processing. Read about our journey of developing Healthsea, an end-to-end spaCy pipeline for analyzing user reviews to supplement products and extracting potential effects on health.
spaCy v3's project and config systems are pretty greatThe road to production has become increasingly harder. Machine Learning Engineers who turn prototypes into production-ready software face difficulties with the lack of tooling and best-practices. spaCy v3, with its configuration and project system, introduced a way to solve this problem. Here's my take on how it works, and how it can ramp-up your team!
Reproducible spaCy NLP Experiments with Weights & BiasesWeights & Biases BlogThis tutorial will show how to add Weights & Biases to any spaCy NLP project to track your experiments, save model checkpoints, and version your datasets.
spaCy v3: Custom trainable relation extraction componentspaCy v3.0 features new transformer-based pipelines that get spaCy’s accuracy right up to the current state-of-the-art, and a new training config and workflow system to help you take projects from prototype to production. In this video, Sofie shows you how to apply all these new features when implementing a custom trainable component from scratch.
The Physical Traits that Define Men and Women in LiteratureThe PuddingAnalysis of physical traits most tied to gender in literature using spaCy.
Bulk Labelling and ProdigyIn this video, we’ll show a bulk labelling technique that can help you prepare data for Prodigy.
Finding Bad Labels for Text Classification with Jupyter and Prodigy In this video, we’ll show you how to use set up Prodigy to find bad labels in text classification tasks. While many of the techniques are applied to text classification, they can also be used for classification tasks in general.
How we built a Stack Overflow Community questions analyzerGitLab BlogHow GitLab used spaCy to analyze and better understand Stack Overflow community questions about their tools and products.
Finding Duplicates in Tabular Data with Jupyter and ProdigyIn this video, we’ll show you how to use Prodigy to train a named entity recognition model from scratch, by taking advantage of semi-automatic annotation and modern transfer learning techniques.
When Women Make HeadlinesThe PuddingUsing spaCy and other packages from the NLP ecosystem for analyzing more than 382,000 headlines to see how women are represented (or misrepresented) in the news.
Universal Dependencies v2.5 Benchmarks for spaCyWe present Universal Dependencies v2.5 benchmarks for spaCy v3.2 that show the competitive performance of spaCy in a direct comparison with Stanza and Trankit using the end-to-end evaluation from the CoNLL 2018 Shared Task.
Introducing spaCy v3.2spaCy v3.2 features usability improvements for custom training and scoring, improved performance and support for floret, our new fastText word vectors algorithm.
🌸 floret v0.10.0Oct 27, 2021fastText + Bloom embeddings for compact, full-coverage vectors with spaCy
Welcome spaCy to the Hugging Face HubHugging Face BlogHugging Face makes it really easy to share your spaCy pipelines with the community! With a single command, you can upload any pipeline package, with a pretty model card and all required metadata auto-generated for you.
What does “real-world NLP” look like and how can students get ready for it?Teaching NLP at NAACL Keynote
spaCy v3: Design concepts explained (behind the scenes)In this video, Ines shows you some of the new design concepts and explain what’s going on under the hood, how we’ve implemented them and most importantly, why.
Explosion in 2020: Our Year in ReviewWhile 2020 hasn’t been easy for anyone, at Explosion we’ve considered ourselves relatively fortunate in this most interesting year. We’ve always worked remotely, so we’ve been able to take both pride and comfort in continuing to ship good software. Here’s a look back at what we’ve been up to.
Introducing Span Categorization in Prodigy and spaCyIn this video, we’ll show you how to use Prodigy for spaCy’s Span Categorizer. We’ll be annotating food recipes and looking into ways to help with consistent annotations and speed up the process with patterns and temporary models.
Diary of a spaCy project: Predicting GitHub TagsMany people assume that working on an NLP project involves a lot of machine learning. Our experience is that it's much less about flowing tensors, and more about making a tailored solution. This blogposts demonstrates how a typical spaCy project could be initiated, implemented and executed towards a custom solution.
Finding Bad Image Data using UMAP and ProdigyIn this video, we’ll show you how to use Prodigy to find bad examples in the Google QuickDraw dataset. We will be leveraging a technique that involves UMAP to find strange images semi-automatically.
Neural edit-tree lemmatization for spaCyWe are happy to introduce a new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages. This lemmatizer learns to predict lemmatization rules from a corpus of examples and removes the need to write an exhaustive set of per-language lemmatization rules.
🛸 spacy-transformers v1.1.0Oct 18, 2021Better serialization, full ModelOutput, mixed-precision training and more
Introducing spaCy v3.1It’s been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new training system and more. Version 3.1 adds more on top of it, including the ability to use predicted annotations during training, a component for predicting arbitrary and overlapping spans and new pipelines for Catalan and Danish.