Introducing Holmes 4.0

A few weeks ago we released version 4.0 of Holmes, which we are now able to offer under a permissive MIT license. Holmes is a library in the spaCy Universe that runs on top of spaCy and enables information extraction and intelligent search, currently for English and German. Holmes goes beyond simple matching algorithms and allows you to look for a specified idea or ideas in a corpus of documents.

Holmes offers two main search mechanisms. The first, structural matching, aims to find text snippets in a corpus that express a given idea exactly and is useful for extracting structured information, for example into a relational database. The second, topic matching, is fuzzier and forms the basis for a real-time search machine. Structural matching is the more fundamental of the two mechanisms, so I shall explain it first, and then go on to discuss topic matching, which builds upon it.

The history of Holmes

I wrote the original version of Holmes while working at msg systems, a large, international IT consultancy with its headquarters near Munich. Holmes was partly based on concepts that were developed at another employer still previous to that and that are described in a U.S. patent. I now work at Explosion and the patent is now controlled by AstraZeneca. Thanks to the goodwill and openness of both AstraZeneca and msg systems, we are able to continue maintaining the library at Explosion and to offer it for the first time under a permissive MIT license. This means that people can now use it, and expand on it if they wish, without having to worry about the patent or other legal issues.

1. Structural matching

You tell Holmes the idea you are looking for, specifying a phrase and strategies for recognizing the individual words, and leave it to the library to find complex examples.

1.1 Recognizing different ways of saying the same thing

Tools like spaCy’s Matcher are an effective way of performing information extraction: the Matcher lets you specify both lexical and grammatical features with which to find phrases within a large body of documents. However, this typically requires many rules to capture a single idea because the same thing can be said in various different ways (see Figure 1). The variation is on two levels. On the one hand, the four examples have different surface grammatical structures; and on the other hand, groups of words like acquire, buy and take over are used synonymously, and specific instances of entities like companies have names like MaxLinear and Datto.

Figure 1: Different headlines announcing company takeovers

The aim of Holmes structural matching is to abstract away both these types of variation so that the user can concentrate on the information they want to extract. You can tell Holmes that a company takes over a company is the idea you are looking for, specify strategies for recognizing company names and synonyms of take over, and leave it to the library to find complex examples without having to write a large number of extra rules.

1.2 Deriving the meanings of sentence structures

The grammatical relationships between the words within a phrase determine how the individual meanings of those words combine to form an overall meaning for the phrase. The rules that drive this apply across any phrases that share a given structure regardless of the specific words involved (see Figure 2). Central to Holmes are rules that transform syntactic surface structures outputted by the standard spaCy models into corresponding underlying semantic structures. Unlike in a typical rule-based system where rules are developed for a specific task and handle words and phrases specific to that task, these rules, which we refer to as meta-rules:

describe the basic grammatical and semantic structures of a language
are valid for any task involving texts written in that language
are maintained as a standard, static part of the core library

Figure 2: Parallel grammatical structures

For example, the meta-rules required to derive the correct semantic structure from the sentences in the Structure 2 row of Figure 2 would handle recognizing the passive construction is … by and assigning the correct semantic roles to the arguments of passive verbs, while the meta-rules required for the Structure 3 and Structure 4 rows would process compound words formed from nouns and participles.

Predicate logic

The meaning communicated by any sentence can be captured using predicate logic. For example, the sentence The child gave the dog a bone expresses a first-order predication give(child, dog, bone) linking the predicate give with the arguments child, dog and bone. Actually being able to derive correct logical structures for every sentence in a corpus could be seen as the Holy Grail of natural language understanding: a machine that genuinely understood the meanings of texts would probably be well beyond passing the Turing Test!

Holmes stops far short of such an ambitious goal, instead using its meta-rules to transform syntactic parse trees in such a way that sentences that express identical meanings emerge with matching semantic structures. Meta-rules and the structures they generate, while heavily inspired by predicate logic, are not intended to correspond to any strict formal, logical or linguistic-theoretical representation: they just do whatever enables meanings to be matched effectively.

Figure 3: Different sentences with a common semantic structure

Figure 3 shows four sentences with different grammatical forms that all emerge with a common semantic structure. Note that in each case only the lexical words — the words that carry independent meaning — survive the transformations to make it into the semantic structure, while grammatical words like a and the are abstracted away.

While the example in Figure 3 involves a single first-order predication, most sentences in real texts use strategies like subordination and relative clauses to express higher-order logical structures. A word can simultaneously serve as the argument of one predication and as the predicate of another predication (give in Figure 4). And linguistic phenomena like relative clauses (Figure 5) and control (Figure 6) can give rise to semantic graph structures that are not trees. The fact that mainstream parsing algorithms are designed to generate trees is one reason why we rely on the combination of standard spaCy models and meta-rules to generate Holmes semantic structures, as opposed to attempting to train a model to produce them directly from raw text.

Coreference resolution

As demonstrated by the first two sentences in Figure 1, pronouns like it typically refer back to nouns earlier on in a text. A pronoun reinstantiates the meaning of its antecedent noun, and that meaning forms part of whatever predication(s) contain the pronoun without the noun having to be reexpressed overtly. Holmes resolves pronouns and other anaphors by calling a library called Coreferee behind the scenes. Although Coreferee is used in a variety of projects, it was written to support Holmes and focuses on the types of coreference that are relevant for Holmes. Like Holmes itself, Coreferee came into being at msg systems, is based on spaCy and in the spaCy Universe, and is now maintained at Explosion.

1.3 Matching search phrases to documents

If the semantic structures Holmes uses are best represented as graphs, how is the user of the library to specify what they want to search for in a simple fashion? The most convenient solution mirrors the way in which the documents to be searched are themselves processed: the user writes simple sentences or phrases from which semantic structures are derived consisting of the relevant lexical words linked by the correct semantic relationships. These search phrases are best imagined as templates that match corresponding ideas at all points in the document corpus.

Figure 7: A search phrase matching a document

A search phrase has to be based on a single sentence, which corresponds to a single spaCy dependency tree with a single word at its root. Whenever a new search phrase is registered, Holmes generates a list of all the words that could match its root word; and whenever a new document is added to the corpus, Holmes adds all the words the document contains to an inverted index. Matching a search phrase begins with querying the inverted index to find all words in the document corpus that match that search phrase’s root word. For each such word, subgraph matching is then performed to check whether the structure surrounding it also corresponds to the rest of the search phrase.

Writing search phrases

Writing effective search phrases can initially seem somewhat counterintuitive. A search phrase is best expressed in grammatically correct language (a dog chases a cat) rather than just as a collection of words (dog chase cat), because although the grammatical words in the search phrase (in this case the two instances of a) will be abstracted away during the process of deriving its semantic structure, including them ensures that the spaCy parser and meta-rules correctly understand what is being expressed.

1.4 Matching the meanings of individual words

Holmes always matches different forms of the same word (e.g. company matches companies; go matches went). However, the real power of the library rests in its ability to combine semantic subgraph matching with four further strategies that greatly widen the scope of matching between individual words in a search phrase and individual words in the documents being searched:

Derivation-based matching: Holmes can match related words that share the same stem (e.g. inform matches information). Making use of such relationships requires meta-rules that capture correspondences in meaning between the various grammatical structures that can surround the two word classes involved. For example, a lawyer accuses should match an accusation by a lawyer, but not an accusation of a lawyer. Note that it is possible for dependencies to match that point in opposite directions, e.g. adopting a resolution (where the dependency is headed by adopting) matches the adopted resolution (where the dependency is headed by resolution).
Entity-based matching: the spaCy models recognize and label tokens that belong to named-entity classes such as people, companies and places. Holmes allows named-entity classes to be specified within search phrases using placeholders of the form ENTITY<label> where <label> denotes an entity label that the spaCy model has assigned to tokens or spans within documents being searched. For example, the search phrase An ENTITYPERSON visits an ENTITYGPE will match the document sentence Richard Hudson visited Berlin. Once you have installed Holmes, you can try this out yourself by entering or copying the Python code below.

import holmes_extractor as holmes
manager = holmes.Manager("en_core_web_trf", number_of_workers=1)
manager.register_search_phrase("An ENTITYPERSON visits an ENTITYGPE")
manager.parse_and_register_document("Richard Hudson visited Berlin")
print(manager.match())

Ontology-based matching: an ontology captures such relationships between words as synonymy (two words mean the same, e.g. dog and hound), hyponymy (a word is a specific type of another word, e.g. puppy is a type of dog) and class membership (e.g. Fido is a named individual of the class dog). You can supply Holmes with an externally hand-crafted ontology that captures relationships between terms in the problem domain you are extracting information about, and Holmes will take it into account when matching search phrases: the range of words that match a given search-phrase word within searched documents is then extended to include the subtree formed by synonyms, hyponyms and named individuals of that word. Hyponym and synonym relationships are transitive, and such a subtree includes hyponyms of hyponyms, hyponyms of hyponyms of hyponyms, synonyms of hyponyms, and so forth. For example, with the ontology in Figure 8, the search phrase An animal yawns would match the document phrase The puppy yawned.

Figure 8: Relationships within an ontology

Embedding-based matching: in the standard spaCy models, each word in a language is associated with a word embedding — a multidimensional vector representation of the word’s meaning derived from the various contexts in which the word occurs within the corpus that was used to train the model. The similarity between the meaning of two words can be approximated by measuring the angle between their embeddings: using this technique, spaCy estimates the similarity of the words dog and puppy at 85.9%; the similarity of the words dog and horse at 62.5%; and the similarity of the words dog and pencil at 20.8%. Holmes allows you to stipulate that a search phrase should match a passage in the document corpus whenever the average embedding similarity of the pairs within a potential match is above a configurable threshold.

Building ontologies

Ontologies are stored in a standard OWL format and can be created and managed using the excellent open-source tool Protégé. Note however that you only need to download Protégé if you want to try out ontology-based matching with a different ontology from the one used in the example, and that it is quite possible to use Holmes without any ontology at all.

1.5 Trying out structural matching

The four sentences in the example in Figure 1 can be matched by a single search phrase using the entity-based matching strategy to find the company names and the ontology-based matching strategy to capture the synonyms of take over:

Download Protégé and create an ontology defining purchase, acquire, buy, take over and takeover as a group of synonyms. Note that this is achieved simply by choosing one of the words and defining the other four as equivalent to it; Holmes will then infer the remaining synonym relationships within the group. Save the ontology with format RDF/XML and file name holmes_test.owl. Alternatively, download holmes_test.owl here.
Install Holmes and enter or copy the Python code below.

# Setup
import holmes_extractor as holmes
ontology = holmes.Ontology("holmes_test.owl")
manager = holmes.Manager("en_core_web_trf", ontology=ontology, number_of_workers=1)

# Register search phrase
manager.register_search_phrase("An ENTITYORG takes over an ENTITYORG")

# Parse documents
manager.parse_and_register_document("Royal Bank of Scotland announces it intends to acquire Brewin Dolphin", "1")
manager.parse_and_register_document("Chipmaker MaxLinear Inc announced on Thursday it will buy Silicon Motion Technology Corp for nearly $4 billion.", "2")
manager.parse_and_register_document("Last month, cybersecurity company Mandiant was purchased by Alphabet", "3")
manager.parse_and_register_document("The Datto takeover by the company Kaseya", "4")

# Perform matching
matches = manager.match()

# Check all documents matched
print(len(matches))
# -> 4

# Extract companies doing the taking over
print([match['word_matches'][0]['document_phrase'] for match in matches])
# -> ['Royal Bank', 'Chipmaker MaxLinear Inc', 'Alphabet', 'Kaseya']

# Extract companies being taken over
print([match['word_matches'][2]['document_phrase'] for match in matches])
# -> ['Brewin Dolphin', 'Silicon Motion Technology Corp', 'cybersecurity company Mandiant', 'Datto']

2. Topic matching

With topic matching each of several smaller phraselets is matched individually against the document corpus and the results are collated.

2.1 Capturing fuzzy meanings with phraselets

Up until now, we have been looking at structural matching: search phrases have been used to retrieve passages in a document corpus that exactly share their semantic structures. This mechanism prioritizes precision over recall: the larger a search phrase gets, the less likely it is that any passage in the document corpus will match it. The second search mechanism that Holmes offers, topic matching, instead prioritizes recall over precision and focuses on what documents are about rather than exactly what they say.

We saw that structural matching against the document corpus uses each search phrase in its entirety as a single semantic graph. Topic matching, on the other hand, is based on a search query that is also converted into one or more semantic structures, but Holmes goes on to derive from these semantic structures mini-search-phrases, each containing one or two lexical words, that we call phraselets (Figure 9). Each of these phraselets is matched individually against the document corpus and the results are collated: a topic match is a conglomeration of phraselet matches.

Whereas the search phrases used in structural matching must conform to several rules and so have to be purposely composed by expert users, topic matching can be driven by any text, which is useful when it is not possible to control the content and quality of the search input.

Situations for which topic matching is an appropriate choice include:

intelligent search driven by short queries entered in real time
searching for passages within a document corpus that share the topic of another preexisting document that is used directly as a long query to drive the search
triggering alerts whenever an information feed contains a post about a specific subject

Formulating effective search queries

Just as is the case for structural-matching search phrases, topic-matching search queries are best formulated as full, grammatical sentences (e.g. a dog chases a cat) rather than as the loose collections of words (e.g. dog chase cat) typical for standard search engines.

2.2 Scoring topic matches

Topic matches are scored according to a variety of factors including:

how close together the phraselet matches are that make them up
whether the phraselet matches consist of one or two words, with two-word matches scoring much higher
whether any two-word phraselet matches overlap
how rare the matched words are within the corpus
any uncertainty that results from words having matched by virtue of embedding similarities or ontology relationships rather than directly

A document passage that has yielded a topic match with a high score will be expressing, in close mutual proximity, many of the same semantic relationships as are contained within the query phrase, and so is likely to be presenting a similar idea or at least to be concerned with the same topic.

Topic matching as a question-answering system

Intelligent search driven by topic matching supports additional word-matching strategies that enable question answering. If the first word in a search query is e.g. Where, Holmes will match this word to spatial adverbial phrases such as In the kitchen in the document corpus and return such phrases as potential answers. Note that the fuzzy nature of topic matching means that whether or not such phrases genuinely answer whatever question has been asked depends on whether or not the other phraselets derived from the query phrase have also matched within the same grammatical structure. This can be determined by calculating the score obtained by matching the query phrase to itself and checking how close the score for a given topic match is to this theoretical maximum.

2.3 Ensuring acceptable performance

The topic matching procedure makes use of a number of performance optimizations including:

starting the matching of two-word phraselets at whichever of the two words is less frequent within the document corpus
not matching two-word phraselets at all where both words are very frequent within the document corpus
only attempting embedding-based matching against words that are relatively rare within the document corpus

These optimizations make it feasible to use topic matching for real-time queries against document corpora consisting of millions of words. However, although topic matching employs multiprocessing to make full use of all the processors available on each machine and is otherwise horizontally and vertically scalable, it would still be unlikely to be cost-effective to apply it to a genuinely massive corpus such as those trawled by mainstream online search engines.

2.4 Trying out topic matching

Intelligent search based on topic matching is exemplified by a demonstration website. The document corpus used for English consists of six Charles Dickens novels. All matching strategies are active. The search engine uses a very small and simple ontology that captures equivalences between names used for the main characters in the books, e.g. the fact that David, Copperfield and David Copperfield form a synonym group.

You can also try out topic matching yourself: install Holmes and enter or copy the Python code below.

# Setup
import holmes_extractor as holmes
manager = holmes.Manager("en_core_web_trf", number_of_workers=1)

# Parse documents
manager.parse_and_register_document("The dog was thinking about whether he wanted to chase the neighbourhood cat.", "1")
manager.parse_and_register_document("The cat kept chasing around and was hoping she wouldn't see a dog anytime soon.", "2")
manager.parse_and_register_document("The children discussed dogs, cats and chasing", "3")

# Perform topic matching
topic_matches = manager.topic_match_documents_against("Increasingly, his life's work appeared to revolve around watching dogs chasing cats.")

# Print the score for each document
print([tm['document_label'] + ": " + str(tm['score']) for tm in topic_matches])

# Print all the topic match information
print(topic_matches)

3. Conclusion

We have seen that Holmes is a general, out-of-the-box tool for information extraction and intelligent search within English or German texts. Please take a few moments to try it out or to have a look at the demonstration website. If you like what you see, please share your experience on social media or check out the repository.