A few weeks ago we released version 4.0 of Holmes, which we are now able to offer under a permissive MIT license. Holmes is a library in the spaCy Universe that runs on top of spaCy and enables information extraction and intelligent search, currently for English and German. Holmes goes beyond simple matching algorithms and allows you to look for a specified idea or ideas in a corpus of documents.
Holmes offers two main search mechanisms. The first, structural matching, aims to find text snippets in a corpus that express a given idea exactly and is useful for extracting structured information, for example into a relational database. The second, topic matching, is fuzzier and forms the basis for a real-time search machine. Structural matching is the more fundamental of the two mechanisms, so I shall explain it first, and then go on to discuss topic matching, which builds upon it.
1. Structural matching
1.1 Recognizing different ways of saying the same thing
Tools like spaCy’s Matcher are an effective way of performing information extraction: the Matcher lets you specify both lexical and grammatical features with which to find phrases within a large body of documents. However, this typically requires many rules to capture a single idea because the same thing can be said in various different ways (see Figure 1). The variation is on two levels. On the one hand, the four examples have different surface grammatical structures; and on the other hand, groups of words like acquire, buy and take over are used synonymously, and specific instances of entities like companies have names like MaxLinear and Datto.
The aim of Holmes structural matching is to abstract away both these types of variation so that the user can concentrate on the information they want to extract. You can tell Holmes that a company takes over a company is the idea you are looking for, specify strategies for recognizing company names and synonyms of take over, and leave it to the library to find complex examples without having to write a large number of extra rules.
1.2 Deriving the meanings of sentence structures
The grammatical relationships between the words within a phrase determine how the individual meanings of those words combine to form an overall meaning for the phrase. The rules that drive this apply across any phrases that share a given structure regardless of the specific words involved (see Figure 2). Central to Holmes are rules that transform syntactic surface structures outputted by the standard spaCy models into corresponding underlying semantic structures. Unlike in a typical rule-based system where rules are developed for a specific task and handle words and phrases specific to that task, these rules, which we refer to as meta-rules:
- describe the basic grammatical and semantic structures of a language
- are valid for any task involving texts written in that language
- are maintained as a standard, static part of the core library
For example, the meta-rules required to derive the correct semantic structure from the sentences in the Structure 2 row of Figure 2 would handle recognizing the passive construction is … by and assigning the correct semantic roles to the arguments of passive verbs, while the meta-rules required for the Structure 3 and Structure 4 rows would process compound words formed from nouns and participles.
Figure 3 shows four sentences with different grammatical forms that all emerge with a common semantic structure. Note that in each case only the lexical words — the words that carry independent meaning — survive the transformations to make it into the semantic structure, while grammatical words like a and the are abstracted away.
While the example in Figure 3 involves a single first-order predication, most sentences in real texts use strategies like subordination and relative clauses to express higher-order logical structures. A word can simultaneously serve as the argument of one predication and as the predicate of another predication (give in Figure 4). And linguistic phenomena like relative clauses (Figure 5) and control (Figure 6) can give rise to semantic graph structures that are not trees. The fact that mainstream parsing algorithms are designed to generate trees is one reason why we rely on the combination of standard spaCy models and meta-rules to generate Holmes semantic structures, as opposed to attempting to train a model to produce them directly from raw text.
1.3 Matching search phrases to documents
If the semantic structures Holmes uses are best represented as graphs, how is the user of the library to specify what they want to search for in a simple fashion? The most convenient solution mirrors the way in which the documents to be searched are themselves processed: the user writes simple sentences or phrases from which semantic structures are derived consisting of the relevant lexical words linked by the correct semantic relationships. These search phrases are best imagined as templates that match corresponding ideas at all points in the document corpus.
A search phrase has to be based on a single sentence, which corresponds to a single spaCy dependency tree with a single word at its root. Whenever a new search phrase is registered, Holmes generates a list of all the words that could match its root word; and whenever a new document is added to the corpus, Holmes adds all the words the document contains to an inverted index. Matching a search phrase begins with querying the inverted index to find all words in the document corpus that match that search phrase’s root word. For each such word, subgraph matching is then performed to check whether the structure surrounding it also corresponds to the rest of the search phrase.
1.4 Matching the meanings of individual words
Holmes always matches different forms of the same word (e.g. company matches companies; go matches went). However, the real power of the library rests in its ability to combine semantic subgraph matching with four further strategies that greatly widen the scope of matching between individual words in a search phrase and individual words in the documents being searched:
- Derivation-based matching: Holmes can match related words that share the same stem (e.g. inform matches information). Making use of such relationships requires meta-rules that capture correspondences in meaning between the various grammatical structures that can surround the two word classes involved. For example, a lawyer accuses should match an accusation by a lawyer, but not an accusation of a lawyer. Note that it is possible for dependencies to match that point in opposite directions, e.g. adopting a resolution (where the dependency is headed by adopting) matches the adopted resolution (where the dependency is headed by resolution).
- Entity-based matching: the spaCy models recognize and label tokens that belong to named-entity classes such as people, companies and places. Holmes allows named-entity classes to be specified within search phrases using placeholders of the form ENTITY<label> where <label> denotes an entity label that the spaCy model has assigned to tokens or spans within documents being searched. For example, the search phrase An ENTITYPERSON visits an ENTITYGPE will match the document sentence Richard Hudson visited Berlin. Once you have installed Holmes, you can try this out yourself by entering or copying the Python code below.
import holmes_extractor as holmes manager = holmes.Manager("en_core_web_trf", number_of_workers=1) manager.register_search_phrase("An ENTITYPERSON visits an ENTITYGPE") manager.parse_and_register_document("Richard Hudson visited Berlin") print(manager.match())
- Ontology-based matching: an ontology captures such relationships between words as synonymy (two words mean the same, e.g. dog and hound), hyponymy (a word is a specific type of another word, e.g. puppy is a type of dog) and class membership (e.g. Fido is a named individual of the class dog). You can supply Holmes with an externally hand-crafted ontology that captures relationships between terms in the problem domain you are extracting information about, and Holmes will take it into account when matching search phrases: the range of words that match a given search-phrase word within searched documents is then extended to include the subtree formed by synonyms, hyponyms and named individuals of that word. Hyponym and synonym relationships are transitive, and such a subtree includes hyponyms of hyponyms, hyponyms of hyponyms of hyponyms, synonyms of hyponyms, and so forth. For example, with the ontology in Figure 8, the search phrase An animal yawns would match the document phrase The puppy yawned.
- Embedding-based matching: in the standard spaCy models, each word in a language is associated with a word embedding — a multidimensional vector representation of the word’s meaning derived from the various contexts in which the word occurs within the corpus that was used to train the model. The similarity between the meaning of two words can be approximated by measuring the angle between their embeddings: using this technique, spaCy estimates the similarity of the words dog and puppy at 85.9%; the similarity of the words dog and horse at 62.5%; and the similarity of the words dog and pencil at 20.8%. Holmes allows you to stipulate that a search phrase should match a passage in the document corpus whenever the average embedding similarity of the pairs within a potential match is above a configurable threshold.
1.5 Trying out structural matching
The four sentences in the example in Figure 1 can be matched by a single search phrase using the entity-based matching strategy to find the company names and the ontology-based matching strategy to capture the synonyms of take over:
- Download Protégé and create an ontology defining purchase, acquire, buy, take over and takeover as a group of synonyms. Note that this is achieved simply by choosing one of the words and defining the other four as equivalent to it; Holmes will then infer the remaining synonym relationships within the group. Save the ontology with format RDF/XML and file name holmes_test.owl. Alternatively, download holmes_test.owl here.
- Install Holmes and enter or copy the Python code below.
# Setup import holmes_extractor as holmes ontology = holmes.Ontology("holmes_test.owl") manager = holmes.Manager("en_core_web_trf", ontology=ontology, number_of_workers=1) # Register search phrase manager.register_search_phrase("An ENTITYORG takes over an ENTITYORG") # Parse documents manager.parse_and_register_document("Royal Bank of Scotland announces it intends to acquire Brewin Dolphin", "1") manager.parse_and_register_document("Chipmaker MaxLinear Inc announced on Thursday it will buy Silicon Motion Technology Corp for nearly $4 billion.", "2") manager.parse_and_register_document("Last month, cybersecurity company Mandiant was purchased by Alphabet", "3") manager.parse_and_register_document("The Datto takeover by the company Kaseya", "4") # Perform matching matches = manager.match() # Check all documents matched print(len(matches)) # -> 4 # Extract companies doing the taking over print([match['word_matches']['document_phrase'] for match in matches]) # -> ['Royal Bank', 'Chipmaker MaxLinear Inc', 'Alphabet', 'Kaseya'] # Extract companies being taken over print([match['word_matches']['document_phrase'] for match in matches]) # -> ['Brewin Dolphin', 'Silicon Motion Technology Corp', 'cybersecurity company Mandiant', 'Datto']
2. Topic matching
2.1 Capturing fuzzy meanings with phraselets
Up until now, we have been looking at structural matching: search phrases have been used to retrieve passages in a document corpus that exactly share their semantic structures. This mechanism prioritizes precision over recall: the larger a search phrase gets, the less likely it is that any passage in the document corpus will match it. The second search mechanism that Holmes offers, topic matching, instead prioritizes recall over precision and focuses on what documents are about rather than exactly what they say.
We saw that structural matching against the document corpus uses each search phrase in its entirety as a single semantic graph. Topic matching, on the other hand, is based on a search query that is also converted into one or more semantic structures, but Holmes goes on to derive from these semantic structures mini-search-phrases, each containing one or two lexical words, that we call phraselets (Figure 9). Each of these phraselets is matched individually against the document corpus and the results are collated: a topic match is a conglomeration of phraselet matches.
Whereas the search phrases used in structural matching must conform to several rules and so have to be purposely composed by expert users, topic matching can be driven by any text, which is useful when it is not possible to control the content and quality of the search input.
Situations for which topic matching is an appropriate choice include:
- intelligent search driven by short queries entered in real time
- searching for passages within a document corpus that share the topic of another preexisting document that is used directly as a long query to drive the search
- triggering alerts whenever an information feed contains a post about a specific subject
2.2 Scoring topic matches
Topic matches are scored according to a variety of factors including:
- how close together the phraselet matches are that make them up
- whether the phraselet matches consist of one or two words, with two-word matches scoring much higher
- whether any two-word phraselet matches overlap
- how rare the matched words are within the corpus
- any uncertainty that results from words having matched by virtue of embedding similarities or ontology relationships rather than directly
A document passage that has yielded a topic match with a high score will be expressing, in close mutual proximity, many of the same semantic relationships as are contained within the query phrase, and so is likely to be presenting a similar idea or at least to be concerned with the same topic.
2.3 Ensuring acceptable performance
The topic matching procedure makes use of a number of performance optimizations including:
- starting the matching of two-word phraselets at whichever of the two words is less frequent within the document corpus
- not matching two-word phraselets at all where both words are very frequent within the document corpus
- only attempting embedding-based matching against words that are relatively rare within the document corpus
These optimizations make it feasible to use topic matching for real-time queries against document corpora consisting of millions of words. However, although topic matching employs multiprocessing to make full use of all the processors available on each machine and is otherwise horizontally and vertically scalable, it would still be unlikely to be cost-effective to apply it to a genuinely massive corpus such as those trawled by mainstream online search engines.
2.4 Trying out topic matching
Intelligent search based on topic matching is exemplified by a demonstration website. The document corpus used for English consists of six Charles Dickens novels. All matching strategies are active. The search engine uses a very small and simple ontology that captures equivalences between names used for the main characters in the books, e.g. the fact that David, Copperfield and David Copperfield form a synonym group.
You can also try out topic matching yourself: install Holmes and enter or copy the Python code below.
# Setup import holmes_extractor as holmes manager = holmes.Manager("en_core_web_trf", number_of_workers=1) # Parse documents manager.parse_and_register_document("The dog was thinking about whether he wanted to chase the neighbourhood cat.", "1") manager.parse_and_register_document("The cat kept chasing around and was hoping she wouldn't see a dog anytime soon.", "2") manager.parse_and_register_document("The children discussed dogs, cats and chasing", "3") # Perform topic matching topic_matches = manager.topic_match_documents_against("Increasingly, his life's work appeared to revolve around watching dogs chasing cats.") # Print the score for each document print([tm['document_label'] + ": " + str(tm['score']) for tm in topic_matches]) # Print all the topic match information print(topic_matches)
We have seen that Holmes is a general, out-of-the-box tool for information extraction and intelligent search within English or German texts. Please take a few moments to try it out or to have a look at the demonstration website. If you like what you see, please share your experience on social media or check out the repository.