With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.
How does the tagger works ?
The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.
In this tutorial, we will demonstrate how to do basic entity extraction in Datafari Community. This post is inspired from https://lucidworks.com/2013/06/27/poor-mans-entity-extraction-with-solr/
Note that for Datafari Enterprise, all the configuration is already done. You just need to add your custom rules in a specific UI, and for further advanced functionalities, Datafari Enterprise allows you to benefit from SolrTextTagger and 3rd party semantic entity extractors.
We want to extract 3 entities in our dataset (files from the Enron dataset in this example) :
- Phone number
- If the document is a resume
NOTE: This is the English version. For the French version, please scroll down.
UPDATE 08/08/16 : update of the post for Datafari v3
UPDATE 01/04/16 : beware that there is a bug with Docker toolbox 1.9.1 for the use of Cassandra (which is a component of Datafari). Update your Docker to 1.10+
This time, we’ll talk about the release of Datafari on Docker.
If you don’t know it yet, Docker is an emulation mechanism that works at a low level of the Linux kernel, hence making it faster than widespread technologies of virtualisation such as VMWare. As its name suggests, you can “dock” applications in an isolated manner, and it will work as a standalone system on your OS.
Although we recommend installing Datafari alone on systems when used in a productive environment, using Datafari on Docker allows you to quickly install Datafari without impacting the configuration and packages in place in your system. Just download the docker image, and the remainder is being taken care of by Docker.