Implementing a Language Model based Similarity with Absolute Discount in Lucene 7

Introduction

While working on Learning To Rank (LTR) test projects, I encountered the need to extract several measures of similarity between a document and a query. As we are using Solr as the core search engine in Datafari, which itself is based on Lucene, I naturally looked at what could be done using those tools. And they already provide a lot of tools ready to use (TF, IDF, TF-IDF, BM25, language model with Dirichlet and Jelinek-Mercer smoothing). But one measure I needed in my work was absent: a language model based similarity with and absolute discount smoothing.

In this blog post, I will first introduce briefly this measure. Then I will present my journey to implement it within Lucene, with all the difficulties I faced. This is not the most elegant way to overcome this problem, but it was sufficient for me. In the conclusion, I will mention other leads that were suggested to me by the kind people of the Lucene developers mailing list. They helped me identify some of the limitations I was facing and directed me to helpful resources to solve my problem.

Langauge Model Based Similarity with Absolute Discount Smoothing

Most of the readers coming here must be familiar with the concept of text based search  engine, the problem of the similarity and the well known TF-IDF and most recent BM 25 measures. And while some of you might know how language model can be used to define a similarity, others may wonder how this would work.

Simply put, the idea is that we can build a language from each document individually. It is then possible to evaluate the pertinence of a document with respect to a query by evaluating the probability that the language model of this document has to generate the text of the query.However, if a document does not contain a certain term, the pure language model would assign this document a 0% probability of generating any query containing that term. That is where smoothing come into play, it allows any term to have a probability of being generated. It is usually based on the corpus statistics plus a small value for any unknown term. This is a very simple explanation to give you an insight of the method.

Although this does sound very different from the classic methods, it has been shown that using language model together with a smoothing method is actually close to what is done using tf-idf. I encourage you to read https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.8019&rep=rep1&type=pdf for example if you want more information on the subject.

What I need is to implement the absolute discount smoothing of the paper cited above (the other two measures are already part of Lucene) and the function looks like this: $\log(1+\frac{\max(docFreq – \delta,0)}{\delta*d_u*p(wc)}) + \log(\delta*\frac{d_u}{|d|})$

where:

  • $docFreq$ is the number of occurrences of the searched term in the document
  • $\delta \in ]0,1[$ is a parameter
  • $d_u$ is the number of unique terms in the current document
  • $p(wc)$ is the probability that the corpus language model generates the term searched term (w)
  • $|d|$ is the total number of terms in the current document

Let’s see how we can implement that into Lucene !

Implementation into Lucene

The other two version of smoothing from the previously mentioned paper have already been implemented as Similarity into Lucene. This means in particular that the corpus language model, used to compute $p(wc)$ is already present if we decide to reuse the same class structure used for those existing implementations.

To reuse the language model computation code already developed, one only has to extend the class org.apache.lucene.search.similarities.LMSimilarity, and then, within the score function, it is possible to access the collection probability through the stats object provided and filled by the LMSimilarity class. If you want an example, you look at the LMDirichletSimilarity class.

The number of occurrences of a term withing the document is provided by default to the score function of any Similarity in Lucene as a double parameter named freq. So that is an easy catch.

We now need the length of the document (or number of terms in the document) $|d|$ and the number of unique terms in the document $d_u$. Those value can’t be accessed from the score funtion, which is called at query time. Therefore, we need to find another way to get access to them.

A similarity function is called at query time obviously, but it is also called during indexation for each document, giving a chance to the class to store a normalization parameter. Score normalization based on document length is well known and allows for better ranking. I won’t go into the details about that here, but this mechanism was a chance to store the length of the document, and it is actually what is stored in this place by the LMSimilarity class. Great, we have access to $|d|$.

From there, I tried to find a way to store $d_u$. I thought it could be done using NumeriDocValuesFields as explained in the Similarity class documentation at the time. However, after asking the mailing list what was the best way to do that, they explained that the support for adding such values would be dropped from Lucene 8 and it would be better to move to another solution. They directed me to the implementation of a query instead of a Similarity, with the drawbacks of having to manually populate docvalue fields with the expected values (the unique term count in my case).

That was really kind of them to answer my questions and provide me with explanations and a workaround. However, digging through how to implement a query did not seem appealing to me given the time I invested understanding the Similarity stack. And because I did not need a distribution ready clean implementation (I only need to extract the information and store it for later use in experiments), I opted for another solution.

I basically implemented two separate similarities, that could be added up to obtain the final result.The function I need to compute is a summation, so I decided to implement each term in a separate Similarity class. Hopefully, this was possible because the first term requires only $d_u$ as a normalization factor, and for the second term, I could compute the ration $\frac{d_u}{|d|}$ as the normalization factor (more on why I took the inverse in a moment).

Then, implementing that should just be a matter of extending LMSimilarity in one case and BasicSimilarity in the other case and overriding the score function right ?

Well, it did not turn out that simple.

LMSimilarity computes the normalization factor itself, and it is not possible to override this. Therefore, I had to copy the entire class to create my own and do the modification, because I needed $d_u$ as the normalization factor and not the document length that was computed originally. So that’s what I did I was finally able to get the first term of my similarity computed !

Then I worked on the second term. Through the discussions with the Lucene’s devs, I learned that a Similarity must respect a few restrictions:

  • It must be positive
  • It must not increase when the normalization factor increases

Keeping that in mind, I tried to find an elegant way to implement my second term. It turned out to be pretty simple: I computed the opposite of the term.

Indeed, $\frac{d_u}{|d|}\in]0,1]$ by definition (there is less unique terms than the total number of terms and both are positive values). $\delta\in]0,1]$ by definition of the parameter too. Therefore $0 < \delta * \frac{d_u}{|d|} \leq 1$ which means that $\log( \delta * \frac{d_u}{|d|}) \leq 0$ and thus $-\log( \delta * \frac{d_u}{|d|}) \geq 0$. Great !

As for the other constraint, $\log$ is an increasing function, so $-\log$ is a decreasing function, therefore $-\log( \delta * \frac{d_u}{|d|})$ will decrease when $\frac{d_u}{|d|}$ increases.

We have both of our terms, we can use them as separate similarities in Lucene / Solr and compute $term1 – term2$ to get the final metric that was originally needed. This requires duplicating the fields and configuring a different similarity for each field to be able to retrieve both terms, but it was sufficient for my needs.

All of the code related to this work is available on github if you want to have a look at it: https://github.com/Endmaril/LMAbsSimilarity.