Romaric Pighetti | France Labs Enterprise Search Blog

Introduction

While working on Learning To Rank (LTR) test projects, I encountered the need to extract several measures of similarity between a document and a query. As we are using Solr as the core search engine in Datafari, which itself is based on Lucene, I naturally looked at what could be done using those tools. And they already provide a lot of tools ready to use (TF, IDF, TF-IDF, BM25, language model with Dirichlet and Jelinek-Mercer smoothing). But one measure I needed in my work was absent: a language model based similarity with and absolute discount smoothing.

In this blog post, I will first introduce briefly this measure. Then I will present my journey to implement it within Lucene, with all the difficulties I faced. This is not the most elegant way to overcome this problem, but it was sufficient for me. In the conclusion, I will mention other leads that were suggested to me by the kind people of the Lucene developers mailing list. They helped me identify some of the limitations I was facing and directed me to helpful resources to solve my problem.

Continue reading →

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

France Labs Enterprise Search Blog

blog on Enterprise Search, Solr, Datafari, ManifoldCF

Author Archives: Romaric Pighetti

Implementing a Language Model based Similarity with Absolute Discount in Lucene 7

Introduction

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

How does the tagger works ?