Bye bye DIH – Hello Datafari

Posted on 20 July 2022 by admin

Replacing DIH with ManifoldCF easily with Datafari

So you were using DIH with your Solr, and you are worried that it may not be maintained actively anymore ? And you have difficulties to find a replacement or an alternative ? We propose here a replacement that relies on Apache ManifoldCF and Datafari, projects that have been actively maintained and updated for several years now.

Datafari is an open source Enterprise Search solution, that – among other things – embeds Apache ManifoldCF and Apache Solr. As such, by installing it you are just some scripts away from having a fully functional DB crawler that fetches the data and sends it to an Apache Solr. Which is exactly what DIH was doing! As a bonus, ManifoldCF can do much more as it proposes plenty of connectors for different sources, and graphical capabilities to configure your crawling (SLAs, time windows, data processing…).

So hop in, and give a look at our DIH replacement tutorial on the Datafari wiki.

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

Posted on 15 January 2019 by Romaric Pighetti

With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

Continue reading →

Entity Extraction in Datafari

Posted on 25 June 2018 by admin

In this tutorial, we will demonstrate how to do basic entity extraction in Datafari Community. This post is inspired from https://lucidworks.com/2013/06/27/poor-mans-entity-extraction-with-solr/

Note that for Datafari Enterprise, all the configuration is already done. You just need to add your custom rules in a specific UI, and for further advanced functionalities, Datafari Enterprise allows you to benefit from SolrTextTagger and 3rd party semantic entity extractors.

We want to extract 3 entities in our dataset (files from the Enron dataset in this example) :

Persons
Phone number
If the document is a resume

Continue reading →

How to upgrade a SolrCloud cluster – Tutorial

Posted on 12 February 2018 by admin

Let’s say that we have a SolrCloud cluster using Solr 4.X. Now we want to upgrade our Solr cluster and to have a modern Solr version as Solr 6.X, how can we do it ?

Well, there are many ways to do it. The cleanest is to install directly the new version of Solr, to adapt the configuration files and to reindex all data. But in production, it is often not acceptable to do that.

In this tutorial, we will upgrade in two steps : from Solr 4 to Solr 5 and then from Solr 5 to Solr 6. It is not possible to upgrade directly from Solr 4 to Solr 6 i.e. between 2 major versions because the index format changes and Solr can only read an index format from the parent major Solr version.

Continue reading →

Tutorial – Deploying Solrcloud 7 on Amazon EC2

Posted on 2 January 2018 by admin

UPDATE: This tutorial is based on Solr 7. If you want to use Solr 8, we strongly recommend to use our recent blog entry to set up Solrcloud 8 on Amazon EC2

In this tutorial, we will be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 7.1, Zookeeper 3.4.10 on Debian 9 instances.
This tutorial explains step by step how to reach this objective.

We will be installing a set of 3 machines, with 3 shards per server, which gives us a total of 9 shards. The replication factor is 3.
We will also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we are at the indexing phase or at the querying phase:

Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.

Continue reading →

Tutorial – Deploying Solrcloud 6 on Amazon EC2

Posted on 12 June 2017 by admin

UPDATE: This tutorial is based on Solr 6. If you want to use Solr 8, we strongly recommend to use our recent blog entry to set up Solrcloud 8 on Amazon EC2

In this tutorial, we will be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 6.6.0, Zookeeper 3.4.6 on Debian 8 instances.
This tutorial explains step by step how to reach this objective.

We will be installing a set of 3 machines, with 3 shards per server, which gives us a total of 9 shards. The replication factor is 3.
We will also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we are at the indexing phase or at the querying phase:

Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.

Continue reading →

Enterprise Search Europe in London – Open source focus

Posted on 1 July 2015 by admin

NOTE: this post has a French version at the bottom of this page.

Enterprise Search Europe is the largest european event dedicated to Enterprise Search. Looking at this year’s agenda, I have the feeling a particular highlight will be given to open source. As in the recent years, several case studies are dedicated to open source, but in addition, the keynote will be focused on it. Charlie Hull, CEO and cofounder of Flax, expert in open source enterprise search, will be sharing his thoughts on the future of search and the link betweeb search and big data. Other open source tracks include a migration from Exalead to Apache Solr (the talk will be given by France Labs, yeeepieeeee), and a round table on open source implementation. You can find more details on the ESEU 2015 programme page.

Continue reading →

Mailing list Solr FR

Posted on 23 February 2015 by admin

NOTE: For English version, please look further down.

Nous avons créé une mailing list Solr Francophone, pour que les développeurs qui se sentent plus à l’aise en français qu’en anglais puissent échanger sur Solr dans la langue de Molière. Retrouvez-nous donc vite sur la mailing list Solr en français !

Continue reading →

Schemaless Solr

Posted on 28 September 2014 by admin

NOTE: French version at the bottom of this page.

We can often see on the web that Elasticsearch is really cool because it is schemaless, and Solr is not. Although Elasticsearch is cool for many reasons, we want to remind you that Solr is also schemaless since July 2013 (Solr 4.4).

To remind you what schemaless means: Without manually editing the Solr schema, it can recognize some data types automatically when receiving data to be indexed. Those types are: Boolean, Integer, Long, Float, Double, and Date

That’s pretty convenient for quick prototyping. Still, as for Elasticsearch, Continue reading →

France Labs Enterprise Search Blog

blog on Enterprise Search, Solr, Datafari, ManifoldCF

Category Archives: Solr

Bye bye DIH – Hello Datafari

Replacing DIH with ManifoldCF easily with Datafari

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

How does the tagger works ?

Entity Extraction in Datafari

How to upgrade a SolrCloud cluster – Tutorial

Tutorial – Deploying Solrcloud 7 on Amazon EC2

Tutorial – Deploying Solrcloud 6 on Amazon EC2

Enterprise Search Europe in London – Open source focus

Mailing list Solr FR

Schemaless Solr