NOTE: This is the English version. You will find the French version further down in this article.
When proposing our expertise search, we are often asked to do performance evaluations on large datasets, for instance in Proof of Concepts. For a recent customer request, in order to gain time and to not use sensitive customer data, we have used log-synth, a random data generator developed by Ted Dunning. We are describing here how to use log-synth in order to generate a 100.000 lines data set.
The first step, which we don’t document here, is about downloading log-synth, unzipping it and building it with maven.
NOTE: There is a French version to this tutorial, which you’ll find on the second half of this blog entry.
In this tutorial, we’ll be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 5.1, the embedded Jetty, Zookeeper 3.4.6 on Debian 7 instances.
This tutorial explains step by step how to reach this objective.
We’ll be installing a set of 3 machines, with 3 shares and 2 replicas per shard, which gives us a total of 9 shards.
We’ll also be installing a Zookeeper ensemble of 3 machines.
This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we’re at the indexing phase or at the querying phase:
- Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
- Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.
NOTE: For English version, please look further down.
Nous avons créé une mailing list Solr Francophone, pour que les développeurs qui se sentent plus à l’aise en français qu’en anglais puissent échanger sur Solr dans la langue de Molière. Retrouvez-nous donc vite sur la mailing list Solr en français !
NOTE: This is the English version. For the French version, please scroll down.
UPDATE 08/08/16 : update of the post for Datafari v3
UPDATE 01/04/16 : beware that there is a bug with Docker toolbox 1.9.1 for the use of Cassandra (which is a component of Datafari). Update your Docker to 1.10+
This time, we’ll talk about the release of Datafari on Docker.
If you don’t know it yet, Docker is an emulation mechanism that works at a low level of the Linux kernel, hence making it faster than widespread technologies of virtualisation such as VMWare. As its name suggests, you can “dock” applications in an isolated manner, and it will work as a standalone system on your OS.
Although we recommend installing Datafari alone on systems when used in a productive environment, using Datafari on Docker allows you to quickly install Datafari without impacting the configuration and packages in place in your system. Just download the docker image, and the remainder is being taken care of by Docker.
NOTE: This is the english version. For the French version, please scrolldown.
For those of you who use or keep an eye on ManifoldCF (it’s a connectors framework from the Apache foundation), its team just released (26th Dec. 2014) ManifoldCF 1.8 and 2.0. Yes, that’s two releases at the same time. Continue reading
NOTE: French version at the bottom of this page.
We can often see on the web that Elasticsearch is really cool because it is schemaless, and Solr is not. Although Elasticsearch is cool for many reasons, we want to remind you that Solr is also schemaless since July 2013 (Solr 4.4).
To remind you what schemaless means: Without manually editing the Solr schema, it can recognize some data types automatically when receiving data to be indexed. Those types are: Boolean, Integer, Long, Float, Double, and Date
That’s pretty convenient for quick prototyping. Still, as for Elasticsearch, Continue reading
UPDATE: This tutorial is based on Solr 4. If you want to use Solr 5, we strongly recommend to use our recent blog entry to set up Solrcloud 5 on Amazon EC2
NOTE: There is French version to this tutorial, which you’ll find on the second half of this blog entry.
In this tutorial, we’ll be installing a SolrCloud cluster on Amazon EC2.
We’ll be using Solr 4.9, Tomcat 7 and Zookeeper 3.4.6 on Debian 7 instances.
This tutorial will explain how to achieve this result.
We’ll be installing a set of 3 machines with 3 shards and 2 replicas per shard, thus creating a set of 9 shards.
We’ll also be installing a Zookeeper ensemble of 3 machines.
With the arrival of Manifold CF 1.0 (now already in v1.6.1), the open source community is looking for tutorials to combine it with Elasticsearch. That’s the intent of this tutorial, which will drive you through the different steps required to make it work.
First, we’ll recap the installation process of Manifold CF (we’ll call it MCF later on). Second, we will install ElasticSearch with the attachment plugin so that it handles rich document indexing. Third, we’ll configure MCF so that it crawls a windows file share and indexes documents in ElasticSearch. In this tutorial, when I specify installation directory such as apache-manifoldcf-1.6.1, you have to complete with the absolute path of the installation directory.
NOTE: English version on top, French version below.
We have noticed that in Solr 4, there is problem with the UI related to cache hit ratio evaluation of SolrMeter. Digging a bit, the problem is due to a type change between Solr 3 and Solr 4. SolrMeter expects a string, whereas Solr4 sends back a float. More precisely, Solr 4 does that within its request handler mbean, in the cache sub category.
We’re now using a patch available for this bug, created by Javier Mendez, see his contribution on this google group.
Still, there is no binary version of SolrMeter, hence this blog. Continue reading
There are several MapReduce snippets to test and learn about Hadoop.
One of these samples is the reversed index, i.e. for each word we want to know which file it comes from. Thus the ouptut file should look like this:
formation formation.txt test.txt
This example is mentioned on the Yahoo developer network, but it doesn’t work as is on version 0.20 of Hadoop.
We decided to rewrite parts of the code in order to make it compatible. This is what you will find in this blog article.