Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

Assume we want to recognize city names in our documents. We can have fields storing the timezone, the localization (longitude and latitude) as well as the country of the city. The “tag” field could contain the different names that are used to designate the city, such as “New york City” and “NYC” for example.

Once the collection is created and populated, and the handler properly configured, you can use the handler, passing it text and receiving the list of entities found into the provided text. The matching is done only using the text provided into the “tag” field, but you can ask the tagger to return all the fields you want from the entity using the standard fl parameter.

We now detail the configuration and a first use case.

Solr Configuration

All the commands detailed here suppose that you have already launched Solr and it is running on the default 8983 port.

We will use here the same example as the one from the Solr documentation, but setting the name and name_tag fields as multivalued. If you already know this, you might want to go directly to the next section for tips on how to use this in a search workflow.

For this tutorial we will use the same “data driven” (or “schemaless”) setup as in the Solr  documentation for simplicity. But you will need to setup an optimized schema in a production environment.

We first create a new dataset:

solr create -c geonames

We then set up the minimal schema we need for the tagger to work:

curl -X POST -H 'Content-type:application/json'  http://localhost:8983/solr/geonames/schema -d '{
  "add-field-type":{
    "name":"tag",
    "class":"solr.TextField",
    "postingsFormat":"FST50",
    "omitNorms":true,
    "omitTermFreqAndPositions":true,
    "indexAnalyzer":{
      "tokenizer":{
         "class":"solr.StandardTokenizerFactory" },
      "filters":[
        {"class":"solr.EnglishPossessiveFilterFactory"},
        {"class":"solr.ASCIIFoldingFilterFactory"},
        {"class":"solr.LowerCaseFilterFactory"},
        {"class":"solr.ConcatenateGraphFilterFactory", "preservePositionIncrements":false }
      ]},
    "queryAnalyzer":{
      "tokenizer":{
         "class":"solr.StandardTokenizerFactory" },
      "filters":[
        {"class":"solr.EnglishPossessiveFilterFactory"},
        {"class":"solr.ASCIIFoldingFilterFactory"},
        {"class":"solr.LowerCaseFilterFactory"}
      ]}
    },

  "add-field":{"name":"name", "type":"text_general", "multivalued":true},

  "add-field":{"name":"name_tag", "type":"tag", "stored":false, "multivalued":true},

  "add-copy-field":{"source":"name", "dest":["name_tag"]}
}'

We make here a small deviation with respect to the Solr tutorial by making the name and name_tag fields multivalued.

And then we need to configure the request handler:

curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/config -d '{
  "add-requesthandler" : {
    "name": "/tag",
    "class":"solr.TaggerRequestHandler",
    "defaults":{"field":"name_tag"}
  }
}'

Adding some data

As you might have guessed, this tutorial is based on the same dataset as the Solr documentation: the geonames dataset and in particular its subset called cities1000. Download and unzip the file and then run the following command to load the data into the geonames collection:

curl -X POST --data-binary @/path/to/cities1000.txt -H 'Content-type:application/csv' \
  'http://localhost:8983/solr/geonames/update?commit=true&optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate'

We are not importing the alternative names here, because we will add them to the name field, which is multivalued for this specific purpose.The following script will update all the cities entries, adding all of their alternative names to the name field. You can tune it to add only a few alternative names if you wish, this will lead to fewer ambiguous tag detections.

Be aware that this script assumes that solr is running on localhost on port 8983. Please adjust based on your configuration if necessary.

#!/bin/bash
cat $1 | while read line
do
  #echo $line
  names=`cut -d $'\t' -f 4 <<< "$line"`
  id=`cut -d $'\t' -f 1 <<< "$line"`
  #echo $names
  IFS=',' read -ra names_table <<< "$names"
  #echo ${names_table[@]}
  for name in "${names_table[@]}"
  do
    echo $id,$name
    payload='[{"id":"'$id'","name_tag":{"add":["'$name'"]}}]'
    echo $payload
    curl -X POST --data-binary "$payload" -H 'Content-type:application/json' 'http://localhost:8983/solr/geonames/update'
  done
  curl -X GET 'http://localhost:8983/solr/geonames/update?commit=true'
done
curl -X GET 'http://localhost:8983/solr/geonames/update?commit=true&optimize=true'

Usage:

./script.sh path/to/cities1000.txt

You can see that this script performs one commit per city and one final commit with optimize=true at the end. Due to the number of cities and alternative names, the script may take a few minutes to execute.

Ready to tag

At this time, we have a configuration similar to the one provided into the Solr documentation, but alternative names of cities can also be used to detect cities into texts.

If you imported all the alternatives names, then the following search and results should be reproducible:

curl -X POST   'http://localhost:8984/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id,name,countrycode&wt=json&indent=on'   -H 'Content-Type:text/plain' -d 'Hello NYC. You know Lyon ?'
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "tagsCount":2,
  "tags":[[
      "startOffset",6,
      "endOffset",9,
      "ids",["5128581"]],
    [
      "startOffset",20,
      "endOffset",24,
      "ids",["2996944"]]],
  "response":{"numFound":2,"start":0,"docs":[
      {
        "id":"2996944",
        "name":["Lyon"],
        "countrycode":["FR"]},
      {
        "id":"5128581",
        "name":["New York City",
          "NYC"],
        "countrycode":["US"]}]
  }}

This looks fine, the tagger found the reference to New York City in the US and the reference to Lyon in France.

With Some Limitations

Lets try a more complex example:

curl -X POST   'http://localhost:8984/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id,name,countrycode&wt=json&indent=on'   -H 'Content-Type:text/plain' -d 'Do you know that when it is 7AM in NYC, it is 1PM in Paris'
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "tagsCount":7,
  "tags":[[
      "startOffset",0,
      "endOffset",2,
      "ids",["3021682",
        "130531"]],
    [
      "startOffset",25,
      "endOffset",27,
      "ids",["3012652",
        "556131"]],
    [
      "startOffset",32,
      "endOffset",34,
      "ids",["1610571",
        "2016412"]],
    [
      "startOffset",35,
      "endOffset",38,
      "ids",["5128581"]],
    [
      "startOffset",43,
      "endOffset",45,
      "ids",["3012652",
        "556131"]],
    [
      "startOffset",50,
      "endOffset",52,
      "ids",["1610571",
        "2016412"]],
    [
      "startOffset",53,
      "endOffset",58,
      "ids",["2988507",
        "6942553",
        "1495561",
        "3703358",
        "4225346",
        "4303602",
        "4432542",
        "4647963",
        "4717560",
        "4125402",
        "4246659",
        "4402452",
        "4519642",
        "4974617",
        "5170013",
        "5226250",
        "5603240",
        "966166"]]],
  "response":{"numFound":25,"start":0,"docs":[
      {
        "id":"2988507",
        "name":["Paris"],
        "countrycode":["FR"]},
      {
        "id":"3012652",
        "name":["Is-sur-Tille"],
        "countrycode":["FR"]},
      {
        "id":"3021682",
        "name":["Daux"],
        "countrycode":["FR"]},
      {
        "id":"6942553",
        "name":["Paris"],
        "countrycode":["CA"]},
      {
        "id":"130531",
        "name":["Jahrom"],
        "countrycode":["IR"]},
      {
        "id":"556131",
        "name":["Is"],
        "countrycode":["RU"]},
      {
        "id":"1495561",
        "name":["Parizh"],
        "countrycode":["RU"]},
      {
        "id":"3703358",
        "name":["París"],
        "countrycode":["PA"]},
      {
        "id":"1610571",
        "name":["In Buri"],
        "countrycode":["TH"]},
      {
        "id":"4225346",
        "name":["Swainsboro"],
        "countrycode":["US"]},
      {
        "id":"4303602",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"4432542",
        "name":["Kosciusko"],
        "countrycode":["US"]},
      {
        "id":"4647963",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"4717560",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"2016412",
        "name":["Smidovich"],
        "countrycode":["RU"]},
      {
        "id":"4125402",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"4246659",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"4402452",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"4519642",
        "name":["New Paris"],
        "countrycode":["US"]},
      {
        "id":"4974617",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"5128581",
        "name":["New York City",
          "NYC"],
        "countrycode":["US"]},
      {
        "id":"5170013",
        "name":["Saint Paris"],
        "countrycode":["US"]},
      {
        "id":"5226250",
        "name":["Beresford"],
        "countrycode":["US"]},
      {
        "id":"5603240",
        "name":["Paris"],
        "countrycode":["US"]},
      {
        "id":"966166",
        "name":["Parys"],
        "countrycode":["ZA"]}]
  }}

Here it gets a bit messy. Let’s have a look at the tags section, which describes which part of the text were detected as entities and the list of entities id associated with each.

"tags":[[
"startOffset",0,  // Matches "Do"
"endOffset",2,
"ids",["3021682",
"130531"]],
[
"startOffset",25, // Matches "is"
"endOffset",27,
"ids",["3012652",
"556131"]],
[
"startOffset",32, // Matches "in"
"endOffset",34,
"ids",["1610571",
"2016412"]],
[
"startOffset",35, // Matches "NYC"
"endOffset",38,
"ids",["5128581"]],
[
"startOffset",43, // Matches "is"
"endOffset",45,
"ids",["3012652",
"556131"]],
[
"startOffset",50, // Matches "in"
"endOffset",52,
"ids",["1610571",
"2016412"]],
[
"startOffset",53, // Matches "Paris"
"endOffset",58,
"ids",["2988507",
"6942553",
"1495561",
"3703358",
"4225346",
"4303602",
"4432542",
"4647963",
"4717560",
"4125402",
"4246659",
"4402452",
"4519642",
"4974617",
"5170013",
"5226250",
"5603240",
"966166"]]],

I added as “comments” the text that was matched for each section. You can relate the ids of each section with the ids in the response section of the full answer displayed above to see the details about the matched entities.

You can see here that there are cities which names (or alternatives names if you dig into the collection to check that) actually are “Do”, “is” and “in”. The tagger matches on text and does not care about semantic, therefore, it identified those as entities. If you want a more robust detection, you might have to add some semantic analysis of the text in the mix to refine the results provided by the tagger, and/or be very careful about the text you add as entities. That is one limitation of the tagger.

The second challenge exposed here is how the tagger handles duplicates: it simply matches all the entities matching the text. The example here is for Paris. There are multiple cities around the world that are called Paris. The tagger will simply mark the text as representing all of them when presented with the text “Paris”. If you want a more precise detection, it is again up to you to filter the entities found based on text analysis or the use of a known context to reduce the possibilities.

In short, the Tagger does not handle any kind of context awareness or semantic analysis, it matches entities based on text only and it is up to us to build filtering pipelines in addition to it to get a clean results if we need it. To do that, text analysis tools like Part of Speech tagger that identify names, verbs, adverbs etc. may be very useful.

The tagger provides one simple filter to manage overlapping entities detection, through the ‘overlaps’ parameter which are described here: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html#tagger-parameters. It has three possible values:

  • ALL: emits all overlapping entities (no filtering)
  • NO_SUB: Do not emit entities that are found completely within another detected entity
  • LONGEST_DOMINANT_RIGHT: Given a cluster of overlapping entities, emit the longest one (by character length). If there is a tie, pick the right-most. Remove any entities overlapping with this entity then repeat the algorithm to potentially find other entities that can be emitted in the cluster.

Tagging Documents During Indexation

In order to use tags at search time, you will need to tag the documents at indexation time. The idea is to extract the entities and add them to a dedicated field for each document. Say we want to index articles and extract the cities mentioned in each article. We also want to perform searches on the text and title of the articles. We can build a new collection:

solr create -c articles

And add a simple schema to it:

curl -X POST -H 'Content-type:application/json'  http://localhost:8983/solr/geonames/schema -d '{
  "add-field-type":{
    "name":"tag",
    "class":"solr.TextField",
    "postingsFormat":"FST50",
    "omitNorms":true,
    "omitTermFreqAndPositions":true,
    "indexAnalyzer":{
      "tokenizer":{
         "class":"solr.StandardTokenizerFactory" },
      "filters":[
        {"class":"solr.EnglishPossessiveFilterFactory"},
        {"class":"solr.ASCIIFoldingFilterFactory"},
        {"class":"solr.LowerCaseFilterFactory"},
        {"class":"solr.ConcatenateGraphFilterFactory", "preservePositionIncrements":false }
      ]},
    "queryAnalyzer":{
      "tokenizer":{
         "class":"solr.StandardTokenizerFactory" },
      "filters":[
        {"class":"solr.EnglishPossessiveFilterFactory"},
        {"class":"solr.ASCIIFoldingFilterFactory"},
        {"class":"solr.LowerCaseFilterFactory"}
      ]}
    },

  "add-field":{"name":"title", "type":"text_general"},

  "add-field":{"name":"text", "type":"text_general"},

  "add-field":{"name":"tags", "type": "plongs", "multivalued":true}
}'

Note that we added a field tags that will hold the id of the tags detected in the text of the article.

Now, once you have built your document, before sending it to Solr for indexation, first call the tagger handler, providing the text of the article you are about to index as a payload. Using curl it looks like this but you will most likely use the language of choice you are using to build your indexation pipeline:

curl -X POST \
  'http://localhost:8983/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id&wt=json&indent=on' \
  -H 'Content-Type:text/plain' -d $ARTICLE_TEXT

Only the id field is important here, you can copy each id returned by the tagger to the tag field of the document you are about to send to Solr for indexation, which will then look something like this:

[{
"title": "My first article",
"text": "An article talking about the differences between NYC, Paris and Rome",
"tags": [13452,12576,12345]
}]

This completes the indexation part.

Using the Tags at Search Time

There are several scenarios here, you can use the tags:

  • As facets
  • In an auto-completion module
  • To display more information in the result page
  • etc.

To use the tags as facet or to display information, using the ids contained into the tags field, perform searches on the geonames collection to retrieve the tags names (for facets you will probably use only the first name) and whatever other information you want to display.

For the auto-completion, the idea is to perform a search or a tag request (with the query being typed as the document) to the geonames collection. Results are proposed at the user to enrich his request, making a request with tags:some_id under the hood.

Final Notes

Indeed the ideas presented here are not things that you can use immediately and have a production environment up and running with. But they provide you with some hints on how the tagger handler feature can be used to enhance the search experience of your users if you ever need it. At France Labs, we are in the process of implementing those elements into Datafari, aiming at providing a solution to identify quickly  named entities identified by our clients, such as project names, part numbers, persons names etc.