Constellio 1.3 architecture part 2

Waiting for the 2.0 version of Constellio, we have decided to draw and explain the system architecture of Constellio 1.3
We thought it could help better understanding the way things work. This entry does not cover how these components are mapped to java classes, servlets, files and databases, but it gives a good overview of how it works.

This is the second entry of a series, as it would take too much time and space to explain all the components in one entry.

Before we explain the components, let us remember that search engines have 3 layers: the data retrieval layer (aka connectors), the indexing layer, and the searching layer. Wherever applicable, I’ll use this terminology to detail the components.

This first entry already explains the data retrieval part.
In this entry, we focus on the indexing mechanism.

Constellio 1.3 Architecture

In the previous entry, we stopped the explanation when crawled data was being pushed to the Constellio Feed Component. The feed component stores all the crawled data into an internal table (now in mysql by default), and each entry is called a record. For now it is used as a staging area for later indexing by Solr, but this same table is also used to persist these records for enhancing the solr indexed content (this will probably change in Constellio 2.0 ).
This records table is then read by the Indexing Manager. Its role is to process the records and push them towards Solr, in order to have them indexed by it. You can notice in the indexing manager the presence of optional plugins. These plugins are actually part of the pre-indexing pipeline of Constellio. If you want to change file names because of some business knowledge, if you want to change the encoding format, if you want to map the records with additional extra data coming from elsewhere to enrich their metadata, plugins are the good place for that. Doculibre created this mechanism so that you can easily play the document manipulation process (at the pre-indexing stage) without the hassle of recompiling Constellio. You can find examples of such plugins here .

Once all the plugins have been triggered, the indexing manager then pushes the documents to Solr (version 3.4). The remainder of the indexing process on the Solr side is standard to Solr, so just refer to the Solr documentation for further details. Constellio uses some default field names which can be uses as they are, or that you can change. Most useful ones are:

doc_defaultSearchField: it’s the copyfield of all the other fields. By default, Constellio searches on this field.
doc_uniqueKey : unique identifier of the document
doc_parsedContent: contains the parsed content of the document. By default, Constellio highlights on this field.

The stored records in the Constellio database are persisted there, which allows to change the configuration of the schema and to reindex without having to recrawl the content. The local DB is also used for the search results generation, but we’ll talk about that in the part 3 which will be addressing the search phase.