Implementing a Language Model based Similarity with Absolute Discount in Lucene 7

Introduction

While working on Learning To Rank (LTR) test projects, I encountered the need to extract several measures of similarity between a document and a query. As we are using Solr as the core search engine in Datafari, which itself is based on Lucene, I naturally looked at what could be done using those tools. And they already provide a lot of tools ready to use (TF, IDF, TF-IDF, BM25, language model with Dirichlet and Jelinek-Mercer smoothing). But one measure I needed in my work was absent: a language model based similarity with and absolute discount smoothing.

In this blog post, I will first introduce briefly this measure. Then I will present my journey to implement it within Lucene, with all the difficulties I faced. This is not the most elegant way to overcome this problem, but it was sufficient for me. In the conclusion, I will mention other leads that were suggested to me by the kind people of the Lucene developers mailing list. They helped me identify some of the limitations I was facing and directed me to helpful resources to solve my problem.

Continue reading

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

Continue reading