Implementing a Language Model based Similarity with Absolute Discount in Lucene 7

Introduction

While working on Learning To Rank (LTR) test projects, I encountered the need to extract several measures of similarity between a document and a query. As we are using Solr as the core search engine in Datafari, which itself is based on Lucene, I naturally looked at what could be done using those tools. And they already provide a lot of tools ready to use (TF, IDF, TF-IDF, BM25, language model with Dirichlet and Jelinek-Mercer smoothing). But one measure I needed in my work was absent: a language model based similarity with and absolute discount smoothing.

In this blog post, I will first introduce briefly this measure. Then I will present my journey to implement it within Lucene, with all the difficulties I faced. This is not the most elegant way to overcome this problem, but it was sufficient for me. In the conclusion, I will mention other leads that were suggested to me by the kind people of the Lucene developers mailing list. They helped me identify some of the limitations I was facing and directed me to helpful resources to solve my problem.

Continue reading