{"id":69,"date":"2013-02-26T14:15:35","date_gmt":"2013-02-26T13:15:35","guid":{"rendered":"http:\/\/www.francelabs.com\/blog\/?p=69"},"modified":"2016-09-21T14:37:00","modified_gmt":"2016-09-21T13:37:00","slug":"tutorial-for-combining-manifoldcf-and-solr-for-files-search","status":"publish","type":"post","link":"https:\/\/www.francelabs.com\/blog\/tutorial-for-combining-manifoldcf-and-solr-for-files-search\/","title":{"rendered":"Tutorial for combining ManifoldCF and Solr for files search"},"content":{"rendered":"<p>NOTE: If you are interested in using ManifoldCF with Solr, you may want to look at our <a title=\"Datafari website\" href=\"http:\/\/www.datafari.com\/en\" target=\"_blank\">Datafari software<\/a>, which combines Apache ManifoldCF with Solr, so it eases this kind of integration. The code is available on google code: <a href=\"https:\/\/github.com\/francelabs\/datafari\">https:\/\/github.com\/francelabs\/datafari<\/a><\/p>\n<p>With the arrival of Manifold CF 1.0 (now already in v2.5), the open source community is looking for tutorials to combine it with Solr 4. That&#8217;s the intent of this tutorial, which will drive you through the different steps required to make it work.<\/p>\n<p>First, we&#8217;ll recap the installation process of Manifold CF (we&#8217;ll call it MCF later on), and of Solr. Second, we&#8217;ll configure both tools so that they can interact with each other. Third, we&#8217;ll configure MCF so that it crawls a windows file share. In this tutorial, when I specify installation directory such as solr-4.1.0, you have to complete with the absolute path of the installation directory.<!--more--><\/p>\n<p><strong>Installing Solr:<\/strong><\/p>\n<p>That&#8217;s a rather simple one. Download the last version of Solr (tested with Solr 4.1) and unzip it. You also have to download a servlet container in order to deploy Solr and MCF. Download the lastest version of tomcat (tested with apache-tomcat-7.0.37) and unzip it.<\/p>\n<p>Some libs are missing in Solr 4.1 : some ClassNotFound Exception can occur because of the missing jars while parsing some files (such as signed files) and these exceptions, because there are not Tika exceptions, lead to a HTTP response error code that is not handled correctly by MCF. To fix this problem, create a directory lib in<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">solr-4.1.0\\example\\solr<\/pre>\n<p>Edit the<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">solr-4.1.0\\example\\solr\\solr.xml<\/pre>\n<p>file and add sharedLib=\u201dlib\u201d to the solr tag:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;solr persistent=&quot;true&quot; sharedLib=&quot;lib&quot;&gt;<\/pre>\n<p>Download asm-3.1.jar and aspectjrt-1.7.1.jar and copy\/paste it into<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">solr-4.1.0\\example\\solr\\lib<\/pre>\n<p>.<\/p>\n<p>Edit the<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">solr-4.1.0\\example\\solr\\collection1\\conf\\solrconfig.xml<\/pre>\n<p>file, set ignoreTikaException to true in the update\/extract request handler:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;requestHandler name=&quot;\/update\/extract&quot;\r\n                  startup=&quot;lazy&quot;\r\n                  class=&quot;solr.extraction.ExtractingRequestHandler&quot; &gt;\r\n    &lt;lst name=&quot;defaults&quot;&gt;\r\n      &lt;bool name=&quot;ignoreTikaException&quot;&gt;true&lt;\/bool&gt;\r\n       \u2026\r\n    &lt;\/lst&gt;\r\n&lt;\/requestHandler&gt;<\/pre>\n<p>In directory<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-tomcat-7.0.37\\conf\\Catalina\\localhost<\/pre>\n<p>(create Catalina and localhost if they don\u2019t exist), create a file called solr.xml.<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;\r\n&lt;Context docBase=&quot; solr-4.1.0\\dist\\solr-4.1.0.war&quot; crossContext=&quot;true&quot;&gt;\r\n  &lt;Environment name=&quot;solr\/home&quot; type=&quot;java.lang.String&quot; value=&quot; solr-4.1.0\\example\\solr&quot; override=&quot;true&quot;\/&gt;\r\n&lt;\/Context&gt;<\/pre>\n<p>You can now start your Solr in command line with<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-tomcat-7.0.37\\bin\\startup.bat<\/pre>\n<p>Have a look at the logs to be sure that everything went well, and test Solr with your browser :<br \/>\n<a title=\"local Solr url\" href=\"http:\/\/localhost:8080\/solr\/#\/\">http:\/\/localhost:8080\/solr\/#\/<\/a><\/p>\n<p><strong>Installing ManifoldCF:<\/strong><\/p>\n<p>We will install and deploy Manifold CF on tomcat with PostGre SQL DB, which is the recommanded DB for Manifold CF deployment in production environnement.<\/p>\n<p>&#8211; First download the last released version of MCF (tested with Apache Manifold CF 1.1.1) and unzip it.<\/p>\n<p>&#8211; You also have to download the jcifs library to crawl Samba share (tested with jcifs-1.3.17). Download and copy\/paste the jcifs-1.3.17.jar into:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.1.1\\connector-lib-proprietary<\/pre>\n<p>.<\/p>\n<p>&#8211; Download version 9.1 of PostgreSQL. Install it with the installer.<\/p>\n<p>In<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-tomcat-7.0.37\\conf\\Catalina\\localhost<\/pre>\n<p>create 3 xml files: mcf-api-service.xml, mcf-authority-service.xml and mcf-crawler-ui.xml.<\/p>\n<p>In mcf-api-service.xml, add the following code:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;\r\n&lt;Context docBase=&quot; apache-manifoldcf-1.1.1\\web\\war\\mcf-api-service.war&quot; crossContext=&quot;true&quot; &gt;\r\n&lt;\/Context&gt;<\/pre>\n<p>Do the same operation for mcf-authority-service.xml and mcf-crawler-ui.xml.<\/p>\n<p>Edit<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.1.1\\connectors.xml<\/pre>\n<p>in this file you need to uncomment:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;repositoryconnector name=&quot;Windows shares&quot; class=&quot;org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector&quot; \/&gt;<\/pre>\n<p>Edit<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.1.1\\multiprocess-example\\properties.xml<\/pre>\n<p>The first five properties can be commented because they are only useful for a jetty server.<\/p>\n<p>Modify the database implementation class in order to use postgre instead of derby:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;property name=&quot;org.apache.manifoldcf.databaseimplementationclass&quot; value=&quot;org.apache.manifoldcf.core.database.DBInterfacePostgreSQL&quot;\/&gt;<\/pre>\n<p>Add a line to specify the database name, the specified database shoudn\u2019t exist for the moment in the postgresql server.<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;property name=&quot;org.apache.manifoldcf.database.name&quot; value=&quot;manifoldcf&quot;\/&gt;<\/pre>\n<p>Modify the username and password:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;property name=&quot;org.apache.manifoldcf.dbsuperusername&quot; value=&quot;user&quot;\/&gt;\r\n&lt;property name=&quot;org.apache.manifoldcf.dbsuperuserpassword&quot; value=&quot;password&quot;\/&gt;<\/pre>\n<p>You can now initialize your database with the following script. This script will automatically create the database and tables that will be used by the crawler:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.1.1\\multiprocess-example\\initialize.bat<\/pre>\n<p>Now set the classpath values with the following script:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.1.1\\multiprocess-example\\setclasspath.bat<\/pre>\n<p>Start the agents that are in charge of crawling the data:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.1.1\\multiprocess-example\\start-agents.bat<\/pre>\n<p>Edit<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-tomcat-7.0.37\\bin\\startup.bat<\/pre>\n<p>Add the following line at the beginning of the file to point the configuration file for MCF :<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">set &quot;CATALINA_OPTS=-Dorg.apache.manifoldcf.configfile= apache-manifoldcf-1.1.1\\multiprocess-example\\properties.xml\\properties.xml&quot;<\/pre>\n<p><strong>Configuring Manifold CF and crawling:<\/strong><\/p>\n<p>Start Tomcat. Go to the admin interface : <a title=\"Local MCF admin UI\" href=\"http:\/\/localhost:8080\/mcf-crawler-ui\/\">http:\/\/localhost:8080\/mcf-crawler-ui\/<\/a><\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/adminUI.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-107\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/adminUI.png\" alt=\"adminUI\" width=\"1215\" height=\"594\" srcset=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/adminUI.png 1215w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/adminUI-300x146.png 300w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/adminUI-1024x500.png 1024w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/adminUI-500x244.png 500w\" sizes=\"auto, (max-width: 1215px) 100vw, 1215px\" \/><\/a><\/p>\n<p>In Output -&gt; List Output Connections, select Add a new output connection<\/p>\n<p>Type a name : \u201cSolr\u201d then select Solr for type and continue.<\/p>\n<p>In the server tab, add collection1 for Core\/Collection name and change the port to 8080 :<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/configSolr.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-90\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/configSolr.png\" alt=\"configSolr\" width=\"927\" height=\"380\" srcset=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/configSolr.png 927w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/configSolr-300x122.png 300w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/configSolr-500x204.png 500w\" sizes=\"auto, (max-width: 927px) 100vw, 927px\" \/><\/a><\/p>\n<p>Click on Save, you should have Connection status:Connection working.<\/p>\n<p>In Repositories -&gt; List Repository Connections, select Add a new connection<\/p>\n<p>Type a name : \u201cFileShare\u201d then select Windows Share for type and continue.<\/p>\n<p>In the Server tab, enter the information of your windows server, (hostname, domain, username and password).<\/p>\n<p>Click on Save, you should have Connection status:Connection working.<\/p>\n<p>In Jobs -&gt; List all Jobs, select Add a new job,<\/p>\n<p>Select a name, \u201cCrawlJob\u201d, in Connection tab, select Solr for the Output connection and FileShare for the repository connection.<\/p>\n<p>Click on continue and go to Paths.<\/p>\n<p>Select your folder and Click on +. Then click on Add. You should have:<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/crawlJob.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-91\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/crawlJob.png\" alt=\"crawlJob\" width=\"1350\" height=\"311\" srcset=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/crawlJob.png 1350w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/crawlJob-300x69.png 300w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/crawlJob-1024x235.png 1024w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/crawlJob-500x115.png 500w\" sizes=\"auto, (max-width: 1350px) 100vw, 1350px\" \/><\/a><\/p>\n<p>Click on Save and go to Jobs -&gt; Status and Job Management<\/p>\n<p>Start your job. When the job is finished, Manifold CF automatically performs a commit on Solr.<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/jobfinished.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-92\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/jobfinished.png\" alt=\"jobfinished\" width=\"1314\" height=\"171\" srcset=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/jobfinished.png 1314w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/jobfinished-300x39.png 300w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/jobfinished-1024x133.png 1024w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/jobfinished-500x65.png 500w\" sizes=\"auto, (max-width: 1314px) 100vw, 1314px\" \/><\/a><\/p>\n<p>You can now perform a search with Solr:<\/p>\n<p><a title=\"http:\/\/localhost:8080\/solr\/collection1\/select?q=*%3A*&amp;fl=id&amp;wt=xml&amp;indent=true\" href=\"http:\/\/localhost:8080\/solr\/collection1\/select?q=*%3A*&amp;fl=id&amp;wt=xml&amp;indent=true\">http:\/\/localhost:8080\/solr\/collection1\/select?q=*%3A*&amp;fl=id&amp;wt=xml&amp;indent=true<\/a><\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/solrResult.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-93\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/solrResult.png\" alt=\"solrResult\" width=\"1034\" height=\"592\" srcset=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/solrResult.png 1034w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/solrResult-300x171.png 300w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/solrResult-1024x586.png 1024w, https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2013\/02\/solrResult-500x286.png 500w\" sizes=\"auto, (max-width: 1034px) 100vw, 1034px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>NOTE: If you are interested in using ManifoldCF with Solr, you may want to look at our Datafari software, which combines Apache ManifoldCF with Solr, so it eases this kind of integration. The code is available on google code: https:\/\/github.com\/francelabs\/datafari &hellip; <a href=\"https:\/\/www.francelabs.com\/blog\/tutorial-for-combining-manifoldcf-and-solr-for-files-search\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19,1,12],"tags":[20,22,60,21,59],"class_list":["post-69","post","type-post","status-publish","format-standard","hentry","category-manifoldcf","category-search","category-solr","tag-files","tag-manifold-cf","tag-manifoldcf","tag-share","tag-solr"],"_links":{"self":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts\/69","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/comments?post=69"}],"version-history":[{"count":55,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts\/69\/revisions"}],"predecessor-version":[{"id":395,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts\/69\/revisions\/395"}],"wp:attachment":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/media?parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/categories?post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/tags?post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}