In order to create a full search solution, beyond the search engine, data retrieval is the very first step. It is conceptually close to ETL or MDM. Its aim is to fetch the data, be it multisource or multiformat, hand it over the indexing engine, manage access security and keep track of the already indexed data.

This needed is strong enough to push the open source community to collaborate on open source frameworks, supported by communities of users and developpers. Among these frameworks, we can cite Aperture, Google Connector Framework and Apache ManifoldCF. Datafari uses the latter to crawl data and for the documents security.

As indicated in its name, Apache ManifoldCF is an Apache foundation project. Originally created solelly to send data to Lucene/Solr, it then evolved to become a standalone project.


You can get more information on the Apache ManifoldCF website. Apache ManifoldCF retrieves the data from different information systems, proposes a framework that enables the addition of new connectors, and manages the retrieval of the ACLs and the connexion to LDAP/AD.
It is under Apache v2 open source licence. France Labs offers its expertise to install, configure, extend and maintain Apache ManifoldCF on your systems.

Among the off the shelf connectors, Apache ManifoldCF proposes Sharepoint, Databases, fileshares, and emails. It proposes a GUI to configure the connectors, traversal time windows, number of documents fetched by crawl cycle, and regex to filter documents.

In which case should you use Apache ManifoldCF

Apache ManifoldCF is recommended when you need to indexed data from several heterogeneous sytems, or when you are anticipating an evolution of your information system. This framework is also well documented, removing the issue of relying on engineers that have homemade their own framework with no documentation.

