Supervised similarity: Learning symmetric relations from duplicate question data

by Matthew Honnibal on

Supervised models for text-pair classification let you create software that assigns a label to two texts, based on some relationship between them. When the relationship is symmetric, it can be useful to incorporate this constraint into the model. This post shows how a siamese convolutional neural network performs on two duplicate question data sets, with experimental results and an interactive demo.

The task of detecting duplicate content occurs on many different platforms. We see it all the time on the spaCy issue tracker, where we're always trying to figure out which threads can be merged. Duplicate detection is also relevant to most discussion forums, where the same questions are asked repeatedly. Fortunately, there are now labelled data sets from two large community question answering sites: the recently-released Quora data set and the StackExchange corpus compiled by researchers from Melbourne University.

Try the demo

For comparison, we've also included an unsupervised baseline in the demo, which computes a simple word vector average, using vectors from the GloVe common crawl model. The superised models learn a very different similarity definition from this unsupervised baseline, which roughly represents the overlap in topics between the documents. Supervision is often discussed as a disadvantage, because it makes you use labelled data. However, it's also a crucial advantage: it lets you use labelled data, to customize the relationships being classified. Without supervision, you're stuck with whatever default relationship the unsupervised algorithm happens to recover.

Siamese networks for symmetric classification

The Quora and StackExchange data sets are labelled according to whether two questions are duplicates. This relationship should be both commutative and transitive. We don't want to compute different results for is_dup(A, B) and is_dup(B, A) — this is the same question, and the model should treat it as such. Similarly, if we know is_dup(A, B) and is_dup(B, C), it should hold that is_dup(A, C).

We can learn a function that obeys these constraints by using a "Siamese" architecture (for the record, I hate this name). The difference in architecture from the asymmetric model I've discussed previously is fairly small. As before, we first encode the sentences separately, but instead of an arbitrary non-linearity, we use a distance function to produce the prediction. Here's a sketch of how the model is put together:

Siamese network outlinedef Siamese(text2vec, similarity_metric):
    def forward(text1, text2):
        vec1, bp_vec1 = text2vec(text1)
        vec2, bp_vec2 = text2vec(text2)
        # If we were doing an asymmetric model, we'd have:
        # sim, bp_sim = multi_layer_perceptron(concatenate(vec1, vec2))
        # cat(vec1, vec2) differs from cat(vec2, vec1) -- hence the asymmetry.
        sim, bp_sim = similarity_metric(vec1, vec2)

        def backward(d_sim, optimize):
            d_vec1, d_vec2 = bp_sim(d_sim, optimize)
            d_text1 = bp_vec1(d_vec1, optimize)
            d_text2 = bp_vec2(d_vec2, optimize)
            return d_text1, d_text2
        return sim, backward
    return forward

The Siamese function above takes two functions, text2vec and similarity_metric. It uses the text2vec function to separately encode each text in the input, and then uses similarity_metric to compare them. Each function is assumed to return a callback to complete its backward pass. Given this, the backpropagation logic of the Siamese network is very simple. Each callback returns the gradient with respect to the original function's inputs, given the gradient of the original function's output. For the similarity metric, I've been using a distance function taken from Chen (2013), which he terms Cauchy Similarity:

Cauchy Similaritydef ChenCauchy(length):
    '''Create a trainable similarity function, that will return the similarity
    and a callback to compute the backward pass given the gradient.

    An optimizer can be passed to the callback to update the weights, e.g.
    Adam, SGD momentum, etc.
    weights = numpy.ones((1, length,))

    def forward(x1, x2):
        diff = x1-x2
        dist_vec = diff**2
        weighted_dist =
        weighted_dist *= weighted_dist > 0
        sim = 1. / (1+weighted_dist)

        def backward(d_sim, optimize):
            d_weighted_dist = d_sim * (-1 / (weighted_dist+1)**2)
            d_weighted_dist *= weighted_dist > 0
            d_weights  = d_weighted_dist * dist_vec
            d_dist_vec = d_weighted_dist * weights
            d_diff = 2 * d_dist_vec * diff
            d_x1 = d_diff
            d_x2 = -d_diff
            optimize(weights, d_weights)
            return d_x1, d_x2
        return sim, backward
    return forward
Recommended readingI recommend Chen's discussion of the Siamese network. Because it's a Master's thesis, the document takes the time to discuss the considerations carefully, and goes through everything step-by-step. It's a nice change from publications in competitive venues, which are written quickly and have an interest in presenting their ideas as novel and exciting.

For the text2vec function, I've been using the convolutional layer I introduced in the previous post, Maxout Window Encoding. The MWE layer has the same aim as the BiLSTM: extract better word features. It rewrites the vector for each word based on the surrounding context. This is useful, because it gets around a major limitation of word vectors. We know that a word like "duck" can have multiple meanings, and we'd like a vector that reflects the meaning in context.

Hover over the vectors to see which words were used to compute them. Hover over the words to see the vectors they influenced.
Why maxout?For the MWE unit to work, it needs to learn a non-linear mapping from the trigram down to a shorter vector. You could use any non-linearity here, but I've found maxout to work quite well. The logic is that adding capacity to the layer by increasing the width M is quite expensive, because our weights layers will be (M, 3*M). The maxout unit instead lets us add capacity by adding another dimension instead. I usually use two or three pieces.

The figure above shows how a single MWE block rewrites the vector for each word given evidence for the two words immediately surrounding it. You can think of the output as trigram vectors — they're built on the information from a three-word window. By simply adding another layer, we'll get vectors computed from 5-grams — the receptive field widens with each layer we go deeper. Here's the full model definition, using Thinc, the library I've been developing for spaCy 2.0's neural network models:

Model definitiondef build_siamese_network(width, depth):
    embed = StaticVectors('en', width)
    pooling = concatenate(mean_pool, max_pool)
    mwe_encode = chain(ExtractWindow(nW=1), Maxout(width))

    # Define a little DSL for block, for convenience.
    with Model.define_operators({'>>': chain, '**': clone}):
        sent2vec = (
            >> flatten_add_lengths
            >> with_getitem(0, embed >> mwe_encode ** depth)
            >> pooling
    model = Siamese(sent2vec, CauchySimilarity(Model.ops, width*2))
    return model

After the MWE layer, we have two matrices, one for each text. The matrices may be of different lengths, and we need to output a single similarity score. The next step is the weakest part of the model: to compare these matrices, we reduce them to two vectors, by taking their elementwise mean and their elementwise max. Of the two operations, the max tends to be more informative — but using both tends to be slightly better than just using the max.

Results and notable examples

The table below shows development set accuracies on the Quora and StackExchange data. Since neither corpus comes with a designated train/dev/test split, I've been using a random 10% and a random 30% partition. All results are still preliminary, and the models' hyper-parameters have not been tuned successfully.MethodologyI'm using the Adam optimizer with averaged parameters and no learning rate decay. Batch size increases from 1 to 128 by 0.01% each iteration. Width is set to 128, and 3 pieces are used for the maxout units. No dropout is applied. An L2 penalty of 1e-6 is used. Static GloVe common crawl embeddings are used, with a trained affine projection to width 128.

To train the model on the StackExchange corpus, we compiled all train pairs from the subforums android, gis, mathematica, programmers, stats, text, unix, webmasters and wordpress. We then extracted all duplicates and two subsequent negative examples.

MWE Depth 0MWE Depth 2MWE Depth 0MWE Depth 2

Despite these caveats, the improvement in accuracy from the symmetric network has been quite consistent. On the Quora data, accuracy improves by 2.3% — a bigger improvement than I've seen from anything else I've been trying. The Maxout Window Encoding layers also seem to help, although the inconsistency of the results makes it difficult to be sure.

It's interesting to compare the outputs of the two models, especially in comparison to a simple unweighted mean of the GloVe vector assigned to each word. Take a look at the following examples in our interactive demo:

Text 1Text 2BaselineQuoraStackExchangeDemo
How to manage success dependency between unit testsHow to structure tests where one test is another test's setup?86%12%47%View
Design pattern for overlapping actions and animations?How do you handle multiple users editing the same piece of data in a webapp?79%8%15%View
How can I get really good at web development?What should I do to become an amazing programmer?93%64%100%View
Where can I find a place to eat pizza?What's the closest Italian restaurant?84%6%46%View
Which companies pay the best in Nebraska?Which companies pay the best in Montana?98%8%69%View

The default similarity model, which takes a simple vector average, skewed high on most of the examples we tried. Much of the difference in output of the Quora and StackExchange models can be explained by the different domains of the text they were trained on. You can also see the effect of the moderation policy, as this controls the definition of duplicates. For instance, questions which differ in details, such as place, are never regarded as duplicates in the Quora data, so the model learns to pay attention to single named entities.

None of these models are doing a better or worse job at uncovering the "true" similarity of the text pairs they're classifying — there's no such thing. Because meaning is so multi-dimensional, pieces of text are always similar in some respects and different in others. The labelling you need will therefore always depend on which relationships are important to your application.

If what you're trying to get out of a similarity score is an intent label, you probably want to regard two sentences with the same verb but different object as similar — they will both trigger the same function. Alternatively, if you're trying to cluster opinions in product reviews, the object is probably the decisive dimension. There's no way for the algorithm to guess what you want, unless you tell it — with example data. That's why supervised methods are so useful.

Matthew Honnibal

About the Author

Matthew Honnibal

Matthew is a leading expert in AI technology, known for his research, software and writings. He completed his PhD in 2009, and spent a further 5 years publishing research on state-of-the-art natural language understanding systems. Anticipating the AI boom, he left academia in 2014 to develop spaCy, an open-source library for industrial-strength NLP.

Join our mailing list

Stay in the loop!