How Enterprise Search can help you for GDPR compliance

Datafari, as an Enterprise Search solution, has an overall visibility over all of the knowledge bases of an organization. As such, it is a good entry point to check where PII (Personally Identifiable Information) are stored.

Indeed, as part of the GDPR requirements, any organization must maintain a list of where PII data are stored. But as soon as the knowledge base grows too much, it is impossible to manually maintain such a list. Distributing this task over the different departments of the organization is a good start, but it has its limits, for instance due to the possible misinterpretation from colleagues about what PII are.

Continue reading

Bye bye DIH – Hello Datafari

Replacing DIH with ManifoldCF easily with Datafari

So you were using DIH with your Solr, and you are worried that it may not be maintained actively anymore ? And you have difficulties to find a replacement or an alternative ? We propose here a replacement that relies on Apache ManifoldCF and Datafari, projects that have been actively maintained and updated for several years now.

Datafari is an open source Enterprise Search solution, that – among other things – embeds Apache ManifoldCF and Apache Solr. As such, by installing it you are just some scripts away from having a fully functional DB crawler that fetches the data and sends it to an Apache Solr. Which is exactly what DIH was doing! As a bonus, ManifoldCF can do much more as it proposes plenty of connectors for different sources, and graphical capabilities to configure your crawling (SLAs, time windows, data processing…).

So hop in, and give a look at our DIH replacement tutorial on the Datafari wiki.

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

Continue reading

Entity Extraction in Datafari

In this tutorial, we will demonstrate how to do basic entity extraction in Datafari Community. This post is inspired from https://lucidworks.com/2013/06/27/poor-mans-entity-extraction-with-solr/

Note that for Datafari Enterprise, all the configuration is already done. You just need to add your custom rules in a specific UI, and for further advanced functionalities, Datafari Enterprise allows you to benefit from SolrTextTagger and 3rd party semantic entity extractors.

We want to extract 3 entities in our dataset (files from the Enron dataset in this example) :

  • Persons
  • Phone number
  • If the document is a resume

Continue reading

Datafari on Docker

NOTE: This is the English version. For the French version, please scroll down.

UPDATE 08/08/16 : update of the post for Datafari v3

UPDATE 01/04/16 : beware that there is a bug with Docker toolbox 1.9.1 for the use of Cassandra (which is a component of Datafari). Update your Docker to 1.10+
https://github.com/docker/docker/issues/18180

This time, we’ll talk about the release of Datafari on Docker.

If you don’t know it yet, Docker is an emulation mechanism that works at a low level of the Linux kernel, hence making it faster than widespread technologies of virtualisation such as VMWare. As its name suggests, you can “dock” applications in an isolated manner, and it will work as a standalone system on your OS.

Although we recommend installing Datafari alone on systems when used in a productive environment, using Datafari on Docker allows you to quickly install Datafari without impacting the configuration and packages in place in your system. Just download the docker image, and the remainder is being taken care of by Docker.

Continue reading