Tutorial for combining ManifoldCF and Solr for files search

NOTE: If you are interested in using ManifoldCF with Solr, you may want to look at our Datafari software, which combines Apache ManifoldCF with Solr, so it eases this kind of integration. The code is available on google code: https://github.com/francelabs/datafari

With the arrival of Manifold CF 1.0 (now already in v2.5), the open source community is looking for tutorials to combine it with Solr 4. That’s the intent of this tutorial, which will drive you through the different steps required to make it work.

First, we’ll recap the installation process of Manifold CF (we’ll call it MCF later on), and of Solr. Second, we’ll configure both tools so that they can interact with each other. Third, we’ll configure MCF so that it crawls a windows file share. In this tutorial, when I specify installation directory such as solr-4.1.0, you have to complete with the absolute path of the installation directory.

Installing Solr:

That’s a rather simple one. Download the last version of Solr (tested with Solr 4.1) and unzip it. You also have to download a servlet container in order to deploy Solr and MCF. Download the lastest version of tomcat (tested with apache-tomcat-7.0.37) and unzip it.

Some libs are missing in Solr 4.1 : some ClassNotFound Exception can occur because of the missing jars while parsing some files (such as signed files) and these exceptions, because there are not Tika exceptions, lead to a HTTP response error code that is not handled correctly by MCF. To fix this problem, create a directory lib in

solr-4.1.0\example\solr

Edit the

solr-4.1.0\example\solr\solr.xml

file and add sharedLib=”lib” to the solr tag:

<solr persistent="true" sharedLib="lib">

Download asm-3.1.jar and aspectjrt-1.7.1.jar and copy/paste it into

solr-4.1.0\example\solr\lib

.

Edit the

solr-4.1.0\example\solr\collection1\conf\solrconfig.xml

file, set ignoreTikaException to true in the update/extract request handler:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <bool name="ignoreTikaException">true</bool>
       …
    </lst>
</requestHandler>

In directory

apache-tomcat-7.0.37\conf\Catalina\localhost

(create Catalina and localhost if they don’t exist), create a file called solr.xml.

<?xml version="1.0" encoding="utf-8"?>
<Context docBase=" solr-4.1.0\dist\solr-4.1.0.war" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value=" solr-4.1.0\example\solr" override="true"/>
</Context>

You can now start your Solr in command line with

apache-tomcat-7.0.37\bin\startup.bat

Have a look at the logs to be sure that everything went well, and test Solr with your browser :
http://localhost:8080/solr/#/

Installing ManifoldCF:

We will install and deploy Manifold CF on tomcat with PostGre SQL DB, which is the recommanded DB for Manifold CF deployment in production environnement.

– First download the last released version of MCF (tested with Apache Manifold CF 1.1.1) and unzip it.

– You also have to download the jcifs library to crawl Samba share (tested with jcifs-1.3.17). Download and copy/paste the jcifs-1.3.17.jar into:

apache-manifoldcf-1.1.1\connector-lib-proprietary

.

– Download version 9.1 of PostgreSQL. Install it with the installer.

In

apache-tomcat-7.0.37\conf\Catalina\localhost

create 3 xml files: mcf-api-service.xml, mcf-authority-service.xml and mcf-crawler-ui.xml.

In mcf-api-service.xml, add the following code:

<?xml version="1.0" encoding="utf-8"?>
<Context docBase=" apache-manifoldcf-1.1.1\web\war\mcf-api-service.war" crossContext="true" >
</Context>

Do the same operation for mcf-authority-service.xml and mcf-crawler-ui.xml.

Edit

apache-manifoldcf-1.1.1\connectors.xml

in this file you need to uncomment:

<repositoryconnector name="Windows shares" class="org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector" />

Edit

apache-manifoldcf-1.1.1\multiprocess-example\properties.xml

The first five properties can be commented because they are only useful for a jetty server.

Modify the database implementation class in order to use postgre instead of derby:

<property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>

Add a line to specify the database name, the specified database shoudn’t exist for the moment in the postgresql server.

<property name="org.apache.manifoldcf.database.name" value="manifoldcf"/>

Modify the username and password:

<property name="org.apache.manifoldcf.dbsuperusername" value="user"/>
<property name="org.apache.manifoldcf.dbsuperuserpassword" value="password"/>

You can now initialize your database with the following script. This script will automatically create the database and tables that will be used by the crawler:

apache-manifoldcf-1.1.1\multiprocess-example\initialize.bat

Now set the classpath values with the following script:

apache-manifoldcf-1.1.1\multiprocess-example\setclasspath.bat

Start the agents that are in charge of crawling the data:

apache-manifoldcf-1.1.1\multiprocess-example\start-agents.bat

Edit

apache-tomcat-7.0.37\bin\startup.bat

Add the following line at the beginning of the file to point the configuration file for MCF :

set "CATALINA_OPTS=-Dorg.apache.manifoldcf.configfile= apache-manifoldcf-1.1.1\multiprocess-example\properties.xml\properties.xml"

Configuring Manifold CF and crawling:

Start Tomcat. Go to the admin interface : http://localhost:8080/mcf-crawler-ui/

adminUI

In Output -> List Output Connections, select Add a new output connection

Type a name : “Solr” then select Solr for type and continue.

In the server tab, add collection1 for Core/Collection name and change the port to 8080 :

configSolr

Click on Save, you should have Connection status:Connection working.

In Repositories -> List Repository Connections, select Add a new connection

Type a name : “FileShare” then select Windows Share for type and continue.

In the Server tab, enter the information of your windows server, (hostname, domain, username and password).

Click on Save, you should have Connection status:Connection working.

In Jobs -> List all Jobs, select Add a new job,

Select a name, “CrawlJob”, in Connection tab, select Solr for the Output connection and FileShare for the repository connection.

Click on continue and go to Paths.

Select your folder and Click on +. Then click on Add. You should have:

crawlJob

Click on Save and go to Jobs -> Status and Job Management

Start your job. When the job is finished, Manifold CF automatically performs a commit on Solr.

jobfinished

You can now perform a search with Solr:

http://localhost:8080/solr/collection1/select?q=*%3A*&fl=id&wt=xml&indent=true

solrResult