Using Datafari to extract text for academic research on NLU and NLP

Extracting raw text to do Natural Language Understanding (NLU) or Natural Language Processing (NLP) is often a boring and time consuming task. Any student or researcher that has already had to prepare a pipeline for that knows what we are talking about. First, assess available open source technologies (very often Apache Tika), then understand how it works, put documents in a folder and make it work with trial and errors, probably through a python script.

This is what we had in mind when preparing a documentation on how to use Datafari Community Edition just for that. After all, Datafari is an enterprise search solution, which means it encompasses these tasks as part of its overall mission to index documents and allow to search through them.

With the documentation we provide, researchers will be able to have a fully operational pipeline that will look in a specific shared folder, extract the text (via Apache Tika), and ouput it in a dedicated folder. And with a bit more motivation, researchers can go beyond and use other connectors than the fileshare, as the pipeline can work with any data source.

Discover now how to extract text from any document thanks to Datafari.

Datafari on Docker

NOTE: This is the English version. For the French version, please scroll down.

UPDATE 08/08/16 : update of the post for Datafari v3

UPDATE 01/04/16 : beware that there is a bug with Docker toolbox 1.9.1 for the use of Cassandra (which is a component of Datafari). Update your Docker to 1.10+
https://github.com/docker/docker/issues/18180

This time, we’ll talk about the release of Datafari on Docker.

If you don’t know it yet, Docker is an emulation mechanism that works at a low level of the Linux kernel, hence making it faster than widespread technologies of virtualisation such as VMWare. As its name suggests, you can “dock” applications in an isolated manner, and it will work as a standalone system on your OS.

Although we recommend installing Datafari alone on systems when used in a productive environment, using Datafari on Docker allows you to quickly install Datafari without impacting the configuration and packages in place in your system. Just download the docker image, and the remainder is being taken care of by Docker.

Continue reading

Integrating Solr with SPIP

Note: French version available at the second half of this blog entry.

Note Note: don’t hesitate to test our new open source package solution Datafari, which combines Apache ManifoldCF, Apache Solr and AjaxFranceLabs 🙂

SPIP is a well known open source platform. We wanted to share with you how to integrate graphically a Solr server with a SPIP server. The scenario is the following: you already have SPIP based web site, and you want to have a nice search functionnality based on the lastet Solr, to benefit from all its cool functionalities. You have set up a Solr, you have crawled your SPIP content, but now you want to have your Solr search in your SPIP website. This is what we present in this tutorial.

Continue reading