Best Way to OCR a PDF in Python Python Tutorials for Digital HumanitiesTutorial by WJB Mattingly on how to use the new spaCy Layout package and Docling to convert PDFs to text.
Prodigy Dashboard PluginThe new dashboard plugin adds a web application for managing annotations, data analytics and annotation progress, and is now available for early beta testing.
Newsletter December 2024This newsletter includes updates to our recent work on PDF processing, the latest in-depth blog post, support for tabular data and workflows for PDF annotation.
Newsletter November 2024This newsletter features our latest releases for processing PDFs, Word documents, scans and other formats, a new library for converting PDFs to structured data with spaCy and multi-page document annotation.
Accelerate your Career with Open-Source AIdotAIPanel discussion about making a career out of open-source software, featuring Gael Varoquaux (scikit-learn), Steeve Morin (ZML) and Ines.
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillationInfoQ Dev SummitLLMs have enormous potential, but also challenge existing workflows in industry that require modularity, transparency and data privacy. In this talk, Ines shows some practical solutions for using the latest models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.
How GitLab uses spaCy to analyze support tickets and empower their communityA case study on GitLab’s large-scale NLP pipelines for extracting actionable insights from support tickets and usage questions.
The NLP and AI Revolution with the spaCy CreatorsVanishing GradientsIn this interview with Hugo Bowne-Anderson, we delve into the forefront of NLP and the future of AI development, covering topics like human-in-the-loop distillation, open-source AI and Explosion’s journey.
A practical guide to human-in-the-loop distillationThis blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.
How to uncover and avoid structural biases in evaluating your Machine Learning/NLP projectsPyData LondonThis talk highlights common pitfalls that occur when evaluating ML and NLP approaches. It provides comprehensive advice on how to set up a solid evaluation procedure in general, and dives into a few specific use-cases to demonstrate artificial bias that unknowingly can creep in.
spaCy meets LLMs: Using Generative AI for Structured DataData+ML Community MeetupThis talk dives deeper into spaCy’s LLM integration, which provides a robust framework for extracting structured information from text, distilling large models into smaller components, and closing the gap between prototype and production.
Getting Started with NLP and spaCyTalkPython CourseThere is a lot of text data out there and maybe you're interested in getting structured data out of it. There are a lot of options out there and this course will introduce you to the field by focussing on spaCy while also exploring other tools.
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMsPyCon DE & PyData BerlinWith the latest advancements in NLP and LLMs, and big companies like OpenAI dominating the space, many people wonder: Are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies?
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMsPyCon Lithuania KeynoteWith the latest advancements in NLP and LLMs, and big companies like OpenAI dominating the space, many people wonder: Are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies?
Streaming spaCyJoin spaCy author and core developer Matt as he works on the library, develops features and fixes bugs, while chatting about all things NLP and open source. Every Thursday at 2pm CET and Friday at 11am CET.
PyLadies entrepreneurs and career developmentPyLadiesConPanel discussion about career challenges and starting your own business with Cheuk Ting Ho, Tereza Iofciu, Anwesha Das, Una Galyeva and Ines.
Distill Your LLMs and Surpass Their PerformanceInfoQ MagazineIn her presentation at InfoQ Dev Summit, Ines Montani provided the audience with practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components.
Newsletter September 2024The latest edition of our newsletter features recent talks, blog posts and interviews, plus real-world examples of practical, applied NLP with LLMs and Generative AI.
Szczecin stolicą programowaniaTVP3 SzczecinNews segment about EuroSciPy 2024 on local Polish television, featuring Ines’ talk and interviews with the organizers.
spaCy Chunks v0.0.2spaCy extension and pipeline component for generating overlapping chunks of sentences or tokens from a document.
Happy 10th Birthday, spaCy!10 years ago today Matt pushed the first commit to spaCy. Since then, the library has evolved as the field moved forward, but also stayed true to its core mission: industrial-strength NLP.
Newsletter June 2024The latest edition of our newsletter, featuring real-world examples of NLP, how to distill LLMs into smaller & faster components and why there’s no need to compromise on best practices and privacy.
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillationPyData LondonLLMs have enormous potential, but also challenge existing workflows in industry that require modularity, transparency and data privacy. In this talk, Ines shows some practical solutions for using the latest models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.
The AI Revolution Won’t Be MonopolizedTalkPython PodcastThere hasn’t been a boom like the AI boom since the .com days. And it may look like a space destined to be controlled by a couple of tech giants. But Ines Montani thinks open source will play an important role in the future of AI.
The application of natural language processing for the extraction of mechanistic information in toxicologyConradi, Luechtefeld, de Haan, Pieters, Freedman, Vanhaecke, Vinken, Teunis (2024)All steps were conducted using the open-source Python package spaCy. Specifically, the NER model was trained using scispaCy en-core-sci-lg (Neumann et al., 2019) as a starting point, which allowed for a vocabulary (word vectors) and grammar trained on scientific literature.
The AI Revolution Will Not Be Monopolized: Behind the scenesOpen Source ML MixerA more in-depth look at the concepts and ideas, academic literature, related experiments and preliminary results for distilled task-specific models.
Cracking the Code: How to Start a Career in AIWelcome to the JungleShort video interview with Ines about the 4 skills job hunters can cultivate for a career in artificial intelligence.
Recognising non-named spatial entities in literary texts: a novel spatial entities classifierKababgi, Grisot, Pennino, Herrmann (2024)In this paper, we present a case study on the prediction of what we call ‘non-named spatial entities’ (NNSE) in a historical corpus of Swiss-German novels using a deep learning model in conjunction with BERT and Prodigy.
Newsletter October 2024Our latest newsletter features blog posts and talks, spaCy’s long-awaited feature for maintaining consistent memory usage in long-running services and an exclusive Prodigy discount.
✨ prodigy v1.16.0Oct 22, 2024Modal plugin for on-demand deployment, cross-platform wheels and UI fixes
Applied NLP with LLMs: Beyond Black-Box MonolithsPyBerlinIn this talk, Ines shows some practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components.
Applied NLP in the Age of Generative AIPyData Amsterdam KeynoteIn this talk, Ines shares the most important lessons we’ve learned from solving real-world information extraction problems in industry, and shows you a new approach and mindset for designing robust and modular NLP pipelines in the age of Generative AI.
10 Years of Open Source: Navigating the Next AI RevolutionEuroSciPy KeynoteIn this talk, Ines shares the most important lessons we’ve learned in 10 years of working on open-source software, our core philosophies that helped us adapt to an ever-changing AI landscape and why open source and interoperability still wins over black-box, proprietary APIs.
Practical Tips for Bootstrapping Information Extraction PipelinesDataHack SummitThis talk presents approaches for bootstrapping NLP pipelines and retrieval via information extraction, including tips for training, modelling and data annotation.
The AI Revolution Will Not Be MonopolizedInfoQOpen-source initiatives are pivotal in democratizing AI technology, offering transparent, extensible tools that empower users. Daniel Dominguez summarizes the key takeaways from Ines’ recent talk for InfoQ.
How S&P Global is making markets more transparent with NLP, spaCy and ProdigyA case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment.
Simply Simplify LanguageInteractive app by the Canton of Zurich, Switzerland, using LLMs and spaCy to analyze and simplify institutional communication and make bureaucratic German more inclusive.
KI – Die künstlerische Intelligenz?Immergut Festival (German)Panelists are discussing the latest developments in Generative AI, hype vs. reality and what those new technologies mean for people, businesses, art, creativity and the music industry.
Economies of Scale Can’t Monopolise the AI RevolutionInfoQ MagazineDuring her presentation at QCon London, Ines Montani stated that economies of scale are not enough to create monopolies in the AI space and that open-source techniques and models will allow everybody to keep up with the “Gen AI revolution”.
Ines Montani on Natural Language ProcessingSoftware Engineering RadioInes speaks with host Jeremy Jung about solving problems using natural language processing. They cover generative vs. predictive tasks, creating a pipeline and breaking down problems, labeling examples for training, fine-tuning models, using LLMs to label data and build prototypes, and the spaCy NLP library.
spaCy Natural Language Processing: From Beginner to AdvancedGuan Wang, Xiaoquan Kong (2024)The first Chinese-language book on spaCy for beginners and experienced practitioners, covering traditional NLP techniques and how to leverage LLMs for various NLP tasks.
From PDFs to AI-ready structured data: a deep diveThis blog post presents a new modular workflow for converting PDFs and similar documents to structured data and shows you how to build end-to-end document understanding and information extraction pipelines for industry use cases.
✨ prodigy v1.17.0Nov 18, 2024Pages UI for multi-page tasks like longer documents, PDFs or collections of images
Serverless custom NLP with LLMs, Modal and ProdigyIn this blog post, we’ll show you how you can go from an idea and little data to a fully custom information extraction model using Prodigy and Modal, no infrastructure or GPU setup required.
The 100 who are shaping AI in EuropeInes is featured among the top 100 individuals who are shaping Artificial Intelligence in Europe, compiled by French newspaper l’Opinion.
Combining the Best of Two Worlds: From TF-IDF to Llama LLMOpen Source Summit EuropeTalk by William Arias, Staff Developer Advocate at GitLab, on combining traditional NLP techniques and LLMs to solve hallucination issues and create robust spaCy applications.
Assessing Fine-Tuned NER Models with Limited Data in French: Automating Detection of New Technologies, Technological Domains, and Startup Names in Renewable EnergyMacLean, Cavallucci (2024)In order to assure the uniformity of the process of fine-tuning each model, we decided to use the spaCy library. This library, one of the most widely used for NLP tasks, allows us to directly modify a simple configuration file in order to define the model.
Back to our roots: Company update and future plansWe’re back to running Explosion as a smaller, independent-minded and self-sufficient company. spaCy and Prodigy will stay stable and sustainable, maintained by their original authors. We’ll keep updating our stack wth the latest technologies, without changing its core identity or purpose.
Once a Maintainer: Sofie Van LandeghemInterview with Sofie about her work as a core maintainer of spaCy, the evolution of NLP, and why dependency management in Python is so terrible.
Exploring the AI nexus with the mind behind spaCyLeading With Data PodcastIn this episode, Matt takes you on a deep dive into the future of data and the challenges facing current Large Language Models (LLMs).
Towards Structured Data: LLMs from Prototype to ProductionU.S. Census Bureau: Center for Optimization and Data Science SeminarThis talk presents pragmatic and practical approaches for how to use LLMs beyond just chat bots, how to ship more successful NLP projects from prototype to production and how to use the latest state-of-the-art models in real-world applications.
ZenML v0.58.0New out-of-the-box Prodigy integration in ZenML for LLMs and beyond, to make data development and annotation a core part of your MLOps lifecycle.
spaCyEx v0.0.2Extension for spaCy’s powerful, linguistically-aware pattern matching that introduces a RegEx-like syntax.
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMsQCon London