How Enterprise Search can help you for GDPR compliance

Datafari, as an Enterprise Search solution, has an overall visibility over all of the knowledge bases of an organization. As such, it is a good entry point to check where PII (Personally Identifiable Information) are stored.

Indeed, as part of the GDPR requirements, any organization must maintain a list of where PII data are stored. But as soon as the knowledge base grows too much, it is impossible to manually maintain such a list. Distributing this task over the different departments of the organization is a good start, but it has its limits, for instance due to the possible misinterpretation from colleagues about what PII are.

Continue reading

Configure a CAS server and CAS management webapp with Docker

The task of setting up a CAS server on Docker is not very smooth. The official documentation is not very explicit about it.

We decided to write a post on this subject in order to help others to quickly configure a CAS server with a complete tutorial.

Warning : here we speak about deploying a TESTING CAS server, this configuration is not for production, especially to authorize any application !

First I would like to mention the very good articles on this site that were a very good basis :

https://fawnoos.com/2022/05/31/cas65x-docker-deployment/
https://fawnoos.com/2021/02/04/cas63-management-webapp/

We were previously using the demo CAS server avaible here : https://casserver.herokuapp.com/cas but for some time now, it is not possible anymore to use it with any application. It refuses unauthorized applications. That is why we needed to have our own CAS server.

Continue reading

Bye bye DIH – Hello Datafari

Replacing DIH with ManifoldCF easily with Datafari

So you were using DIH with your Solr, and you are worried that it may not be maintained actively anymore ? And you have difficulties to find a replacement or an alternative ? We propose here a replacement that relies on Apache ManifoldCF and Datafari, projects that have been actively maintained and updated for several years now.

Datafari is an open source Enterprise Search solution, that – among other things – embeds Apache ManifoldCF and Apache Solr. As such, by installing it you are just some scripts away from having a fully functional DB crawler that fetches the data and sends it to an Apache Solr. Which is exactly what DIH was doing! As a bonus, ManifoldCF can do much more as it proposes plenty of connectors for different sources, and graphical capabilities to configure your crawling (SLAs, time windows, data processing…).

So hop in, and give a look at our DIH replacement tutorial on the Datafari wiki.

Using Datafari to extract text for academic research on NLU and NLP

Extracting raw text to do Natural Language Understanding (NLU) or Natural Language Processing (NLP) is often a boring and time consuming task. Any student or researcher that has already had to prepare a pipeline for that knows what we are talking about. First, assess available open source technologies (very often Apache Tika), then understand how it works, put documents in a folder and make it work with trial and errors, probably through a python script.

This is what we had in mind when preparing a documentation on how to use Datafari Community Edition just for that. After all, Datafari is an enterprise search solution, which means it encompasses these tasks as part of its overall mission to index documents and allow to search through them.

With the documentation we provide, researchers will be able to have a fully operational pipeline that will look in a specific shared folder, extract the text (via Apache Tika), and ouput it in a dedicated folder. And with a bit more motivation, researchers can go beyond and use other connectors than the fileshare, as the pipeline can work with any data source.

Discover now how to extract text from any document thanks to Datafari.

Tutorial – Deploying Solrcloud 8 on Amazon EC2

In this tutorial, we will be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 8.6.2, Zookeeper 3.5.7 on Debian 10 instances.
This tutorial explains step by step how to reach this objective.

We will be installing a set of 3 machines, with 3 shards per server, which gives us a total of 9 shards. The replication factor is 3.
We will also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we are at the indexing phase or at the querying phase:

  • Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
  • Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.
Continue reading

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

Continue reading

Entity Extraction in Datafari

In this tutorial, we will demonstrate how to do basic entity extraction in Datafari Community. This post is inspired from https://lucidworks.com/2013/06/27/poor-mans-entity-extraction-with-solr/

Note that for Datafari Enterprise, all the configuration is already done. You just need to add your custom rules in a specific UI, and for further advanced functionalities, Datafari Enterprise allows you to benefit from SolrTextTagger and 3rd party semantic entity extractors.

We want to extract 3 entities in our dataset (files from the Enron dataset in this example) :

  • Persons
  • Phone number
  • If the document is a resume

Continue reading

How to upgrade a SolrCloud cluster – Tutorial

Let’s say that we have a SolrCloud cluster using Solr 4.X. Now we want to upgrade our Solr cluster and to have a modern Solr version as Solr 6.X, how can we do it ?

Well, there are many ways to do it. The cleanest is to install directly the new version of Solr, to adapt the configuration files and to reindex all data. But in production, it is often not acceptable to do that.

In this tutorial, we will upgrade in two steps : from Solr 4 to Solr 5 and then from Solr 5 to Solr 6. It is not possible to upgrade directly from Solr 4 to Solr 6 i.e. between 2 major versions because the index format changes and Solr can only read an index format from the parent major Solr version.

Continue reading

Tutorial – Deploying Solrcloud 6 on Amazon EC2

UPDATE: This tutorial is based on Solr 6. If you want to use Solr 8, we strongly recommend to use our recent blog entry to set up Solrcloud 8 on Amazon EC2

In this tutorial, we will be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 6.6.0, Zookeeper 3.4.6 on Debian 8 instances.
This tutorial explains step by step how to reach this objective.

We will be installing a set of 3 machines, with 3 shards per server, which gives us a total of 9 shards. The replication factor is 3.
We will also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we are at the indexing phase or at the querying phase:

  • Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
  • Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.

Continue reading

Generating big data sets for search engines

NOTE: This is the English version. You will find the French version further down in this article.

When proposing our expertise search, we are often asked to do performance evaluations on large datasets, for instance in Proof of Concepts. For a recent customer request, in order to gain time and to not use sensitive customer data, we have used log-synth, a random data generator developed by Ted Dunning. We are describing here how to use log-synth in order to generate a 100.000 lines data set.

The first step, which we don’t document here, is about downloading log-synth, unzipping it and building it with maven.

Continue reading