{"id":242,"date":"2014-06-25T14:56:52","date_gmt":"2014-06-25T13:56:52","guid":{"rendered":"http:\/\/www.francelabs.com\/blog\/?p=242"},"modified":"2014-06-25T15:23:27","modified_gmt":"2014-06-25T14:23:27","slug":"tutorial-for-combining-manifoldcf-and-elasticsearch-for-files-search","status":"publish","type":"post","link":"https:\/\/www.francelabs.com\/blog\/tutorial-for-combining-manifoldcf-and-elasticsearch-for-files-search\/","title":{"rendered":"Tutorial for combining ManifoldCF and Elasticsearch for files search"},"content":{"rendered":"<p>With the arrival of Manifold CF 1.0 (now already in v1.6.1), the open source community is looking for tutorials to combine it with Elasticsearch. That\u2019s the intent of this tutorial, which will drive you through the different steps required to make it work.<\/p>\n<p>First, we\u2019ll recap the installation process of Manifold CF (we\u2019ll call it MCF later on). Second, we will install ElasticSearch with the attachment plugin so that it handles rich document indexing. Third, we\u2019ll configure MCF so that it crawls a windows file share and indexes documents in ElasticSearch. In this tutorial, when I specify installation directory such as apache-manifoldcf-1.6.1, you have to complete with the absolute path of the installation directory.<br \/>\n<!--more--><\/p>\n<p><strong>Installing ManifoldCF:<\/strong><\/p>\n<p>We will install and deploy Manifold CF on tomcat with PostGre SQL DB, which is the recommanded DB for Manifold CF deployment in production environnement.<\/p>\n<ul>\n<li>First download the last released version of MCF (tested with Apache Manifold CF 1.6.1) and unzip it.<\/li>\n<li>You also have to download the jcifs library to crawl Samba share (tested with jcifs-1.3.17). Download and copy\/paste the jcifs-1.3.17.jar into:\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.6.1\\connector-lib-proprietary<\/pre>\n<\/li>\n<li>Download version 9.3 of PostgreSQL. Install it with the installer.<\/li>\n<\/ul>\n<p>In<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-tomcat-8.0.8\\conf\\Catalina\\localhost<\/pre>\n<p>create 3 xml files: mcf-api-service.xml, mcf-authority-service.xml and mcf-crawler-ui.xml (you will probably have to create the Catalina and localhost directories).<\/p>\n<p>In mcf-api-service.xml, add the following code:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;xmlversion=&quot;1.0&quot;encoding=&quot;utf-8&quot;?&gt;\r\n&lt;Context docBase=&quot;apache-manifoldcf-1.6.1\\web\\war\\mcf-api-service.war&quot; crossContext=&quot;true&quot;&gt;\r\n&lt;\/Context&gt;<\/pre>\n<p>Do the same operation for mcf-authority-service.xml and mcf-crawler-ui.xml.<\/p>\n<p>Edit <\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.6.1\\connectors.xml<\/pre>\n<p> in this file you need to uncomment:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;repositoryconnector name=&quot;Windows shares&quot; class=&quot;org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector&quot; \/&gt;<\/pre>\n<p>Edit <\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.6.1\\multiprocess-file-example\\properties.xml<\/pre>\n<p>The first five properties can be commented because they are only useful for a jetty server.<\/p>\n<p>Modify the database implementation class in order to use postgre instead of derby:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;propertyname=&quot;org.apache.manifoldcf.databaseimplementationclass&quot; value=&quot;org.apache.manifoldcf.core.database.DBInterfacePostgreSQL&quot;\/&gt;<\/pre>\n<p>Add a line to specify the database name, the specified database shoudn\u2019t exist for the moment in the postgresql server.<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;propertyname=&quot;org.apache.manifoldcf.database.name&quot;value=&quot;manifoldcf&quot;\/&gt;<\/pre>\n<p>Modify the username and password:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;propertyname=&quot;org.apache.manifoldcf.dbsuperusername&quot;value=&quot;postgres&quot;\/<\/pre>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">propertyname=&quot;org.apache.manifoldcf.dbsuperuserpassword&quot;value=&quot;password&quot;\/&gt;<\/pre>\n<p>You can now initialize your database with the following script. This script will automatically create the database and tables that will be used by the crawler:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.6.1\\multiprocess-file-example\\initialize.bat<\/pre>\n<p>Now set the classpath values with the following script:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.6.1\\multiprocess-file-example\\setclasspath.bat<\/pre>\n<p>Edit <\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-tomcat-8.0.8\\bin\\startup.bat<\/pre>\n<p>Add the following line at the beginning of the file to point the configuration file for MCF :<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\"> set &quot;CATALINA_OPTS=-Dorg.apache.manifoldcf.configfile=apache-manifoldcf-1.6.1\\multiprocess-example\\properties.xml\\properties.xml&quot;<\/pre>\n<p><strong>Installing ElasticSearch with the attachment plugin<\/strong><\/p>\n<p>Unzip elasticsearch-1.2.1.zip.<\/p>\n<p>Run<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">elasticsearch-1.2.1\/bin\/plugin.bat -install elasticsearch\/elasticsearch-mapper-attachments\/2.0.0<\/pre>\n<p>to download the attachment plugin. It will be automatically installed during the next startup of Elasticsearch.<\/p>\n<p>Start Elasticsearch:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">elasticsearch-1.2.1\/bin\/elasticsearch.bat<\/pre>\n<p>Create the new index \u201cfileshare\u201d. You can use curl for windows or any client that run REST commands:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">curl -X PUT &quot;localhost:9200\/fileshare&quot; -d '{\r\n   &quot;settings&quot; : { &quot;index&quot; : { &quot;number_of_shards&quot; : 1, &quot;number_of_replicas&quot; : 0 }}\r\n}'\r\n<\/pre>\n<p>Now we will have to add a mapping configuration that creates the document type \u201cfile\u201d that uses the attachment plugin to handle rich documents:<\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">curl -X PUT &quot;192.168.0.82:9200\/fileshare\/file\/_mapping&quot; -d '{\r\n  &quot;file&quot; : {\r\n    &quot;properties&quot; : {\r\n      &quot;file&quot; : {\r\n         &quot;type&quot; : &quot;attachment&quot;\r\n               }\r\n                   }\r\n           }\r\n}'<\/pre>\n<p><strong>Configuring Manifold CF and crawling:<\/strong><\/p>\n<p>Start the agents that are in charge of crawling the data:<\/p>\n<pre class=\"brush: powershell; title: ; notranslate\" title=\"\">apache-manifoldcf-1.6.1\\ multiprocess-file-example\\start-agents.bat<\/pre>\n<p>Start Tomcat. Go to the admin interface :\u00a0<a href=\"http:\/\/localhost:8080\/mcf-crawler-ui\/\">http:\/\/localhost:8080\/mcf-crawler-ui\/<\/a><\/p>\n<p>By default, the credentials are admin\/admin.<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-thumbnail wp-image-243\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_1-150x150.png\" alt=\"Document Ingestion\" width=\"150\" height=\"150\" \/><\/a><\/p>\n<p>In Output -> List Output Connections, select Add a new output connection<\/p>\n<p>Type a name : \u201cElasticSearch\u201d then select ElasticSearch for type and continue.<\/p>\n<p>In the parameters tab, change index name to fileshare and index type to file.<a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-thumbnail wp-image-244\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_2-150x150.png\" alt=\"Document ingestion 2\" width=\"150\" height=\"150\" \/><\/a><\/p>\n<p>Click on Save, you should have Connection status:Connection working.<\/p>\n<p>In Repositories -&gt; List Repository Connections, select Add a new connection<\/p>\n<p>Type a name : \u201cFileShare\u201d then select Windows Share for type and continue.<\/p>\n<p>In the Server tab, enter the information of your windows server, (hostname, domain, username and password).<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-thumbnail wp-image-245\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_3-150x150.png\" alt=\"Document Ingestion 3\" width=\"150\" height=\"150\" \/><\/a><\/p>\n<p>Click on Save, you should have Connection status:Connection working.<\/p>\n<p>In Jobs -&gt; List all Jobs, select Add a new job,<\/p>\n<p>Select a name, \u201cCrawlFile\u201d, in Connection tab, select ElasticSearch for the Output connection and FileShare for the repository connection.<\/p>\n<p>Click on continue and go to Paths.<\/p>\n<p>Select your folder and Click on +. Then click on Add. You should have:<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-thumbnail wp-image-246\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_4-150x150.png\" alt=\"Document Ingestion 4\" width=\"150\" height=\"150\" \/><\/a><\/p>\n<p>Click on Save and go to Jobs -&gt; Status and Job Management<\/p>\n<p>Start your job and wait until it ends:<\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-thumbnail wp-image-247\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_5-150x150.png\" alt=\"Status of MCF Jobs\" width=\"150\" height=\"150\" \/><\/a><\/p>\n<p>You can now perform a search with ElasticSearch. For instance we can now search (and find) the document in our fileshare which contains the word \u201csearch\u201d:<\/p>\n<p><a href=\"http:\/\/localhost:9200\/fileshare\/_search?q=search&amp;pretty=true\">http:\/\/localhost:9200\/fileshare\/_search?q=search&amp;pretty=true<\/a><\/p>\n<p><a href=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_6.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-thumbnail wp-image-248\" src=\"https:\/\/www.francelabs.com\/blog\/wp-content\/uploads\/2014\/06\/blogpost_MCF_ES_6-150x150.png\" alt=\"Search results\" width=\"150\" height=\"150\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the arrival of Manifold CF 1.0 (now already in v1.6.1), the open source community is looking for tutorials to combine it with Elasticsearch. That\u2019s the intent of this tutorial, which will drive you through the different steps required to &hellip; <a href=\"https:\/\/www.francelabs.com\/blog\/tutorial-for-combining-manifoldcf-and-elasticsearch-for-files-search\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[39,19,1],"tags":[37,38,60],"class_list":["post-242","post","type-post","status-publish","format-standard","hentry","category-elasticsearch-2","category-manifoldcf","category-search","tag-elasticsearch","tag-file-search","tag-manifoldcf"],"_links":{"self":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts\/242","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/comments?post=242"}],"version-history":[{"count":7,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts\/242\/revisions"}],"predecessor-version":[{"id":255,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/posts\/242\/revisions\/255"}],"wp:attachment":[{"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/media?parent=242"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/categories?post=242"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.francelabs.com\/blog\/wp-json\/wp\/v2\/tags?post=242"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}