Configure a CAS server and CAS management webapp with Docker

The task of setting up a CAS server on Docker is not very smooth. The official documentation is not very explicit about it.

We decided to write a post on this subject in order to help others to quickly configure a CAS server with a complete tutorial.

Warning : here we speak about deploying a TESTING CAS server, this configuration is not for production, especially to authorize any application !

First I would like to mention the very good articles on this site that were a very good basis :

https://fawnoos.com/2022/05/31/cas65x-docker-deployment/
https://fawnoos.com/2021/02/04/cas63-management-webapp/

We were previously using the demo CAS server avaible here : https://casserver.herokuapp.com/cas but for some time now, it is not possible anymore to use it with any application. It refuses unauthorized applications. That is why we needed to have our own CAS server.

For our tutorial, we took a vanilla instance into Digital Ocean on Debian 12 with 16 GB RAM.

  • Requirements :
    • Java 11
    • Docker
    • A real certificate name on the server. Indeed without it, we could not have a functional environment (we used LetsEncrypt in this example)
    • jq installed
      See annexes below to have indications to install these dependencies
  1. Installation of the CAS server
  • Create a keystore on the server with the SSL certificate generated

We assume that the certificate and the key were issued by LetsEncrypt and are located into /etc/letsencrypt/live/$DOMAIN_NAME

Replace $DOMAIN_NAME by the name of your domain, in our example it is castest.datafari.com

export DOMAIN_NAME=castest.datafari.com
openssl pkcs12 -export -in /etc/letsencrypt/live/$DOMAIN_NAME/fullchain.pem -inkey /etc/letsencrypt/live/$DOMAIN_NAME/privkey.pem -out letsencrypt.p12

When the script asks you for a password enter ‘changeit’.

With the last command, we created a keystore into p12 format. We need to convert it into JKS format.

keytool -importkeystore -srckeystore letsencrypt.p12 -srcstoretype PKCS12 -destkeystore letsencrypt.jks -deststoretype JKS

When the script asks you for a password : destination and source, always enter ‘changeit’.

We can now run the CAS server with Docker.

Create a directory for CAS : here /var/work/cas

mkdir -p /var/work/cas

Copy the JKS keystore to this folder :

cp /root/letsencrypt.jks /var/work/cas

Rename it to ‘thekeystore’ and change the permission on it (just in case)

mv /var/work/letsencrypt.jks /var/work/thekeystore
chmod 777 /var/work/thekeystore

Before launching the CAS server, we can set some settings. Look at https://fawnoos.com/2022/05/31/cas65x-docker-deployment/#container-configuration to have more information.

“Adjust the CAS root logging level to debug so we can get more details from the running CAS web application.
Rename the CAS SSO cookie to SSO_COOKIE.
Allow the service registry instance to initialize and bootstrap itself from the embedded JSON files that ship with CAS.
Enable the schedule for the service registry loader”

https://fawnoos.com/2022/05/31/cas65x-docker-deployment/#container-configuration

Basically with this configuration, we will have more verbosity on logs and we will authorize all applications with our CAS server.

Enter this command :

properties='{
  "logging": {
    "level": {
      "org.apereo.cas": "debug"
    }
  },
  "cas": {
    "tgc": {
      "name": "SSO_COOKIE"
    },
    "service-registry": {
      "core": {
        "init-from-json": true
      },
      "schedule": {
        "enabled": false
      }
    }
  }
}'
properties=$(echo "$properties" | tr -d '[:space:]')
echo -e "***************************\nCAS properties\n***************************"
echo "${properties}" | jq

We can now use these properties into the SPRING_APPLICATION_JSON property.

We can now launch the CAS server. We add a bind mount with the keystore we just created:

export CAS_KEYSTORE=/var/work/cas/thekeystore
docker run --rm -d   --mount type=bind,source="${CAS_KEYSTORE}",target=/etc/cas/thekeystore   -e SPRING_APPLICATION_JSON="${properties}"   -p 8444:8443 --name casserver apereo/cas:6.5.

After some time, the CAS server can be found at this url :

https://$DOMAIN_NAME:8444/cas/login 

so in our example it would be:

https://castest.datafari.com:8444/cas/login
CAS login UI

The default credentials are :

user : casuser
password: Mellon

We can now install the CAS management webapp.

2. Installation of the CAS management webapp

Clone the code from the Github project CAS Management Overlay

Here we clone it into /var/work/cas :

cd /var/work/cas
git clone https://github.com/apereo/cas-management-overlay.git

We want to checkout the code with the 6.5 version :

cd cas-management-overylay
git checkout 6.5

Copy the keystore into the project :

cp /var/work/cas/thekeystore /var/work/cas/cas-management-overlay/etc/cas/thekeystore

Edit the management.properties located into cas/config :

nano /var/work/cas/cas-management-overlay/etc/cas/config/management.properties
cas.server.name=https://$DOMAIN_NAME:8444
cas.server.prefix=${cas.server.name}/cas

mgmt.server-name=https://$DOMAIN_NAME:8443
mgmt.admin-roles[0]=ROLE_ADMIN
mgmt.user-properties-file=file:/etc/cas/config/users.json

logging.config=file:/etc/cas/config/log4j2-management.xml

Edit the properties cas.server.name and mgmt.server-name by replacing by your domain name. Here it is the file with our domain example :

cas.server.name=https://castest.datafari.com:8444
cas.server.prefix=${cas.server.name}/cas

mgmt.server-name=https://castest.datafari.com:8443
mgmt.admin-roles[0]=ROLE_ADMIN
mgmt.user-properties-file=file:/etc/cas/config/users.json

logging.config=file:/etc/cas/config/log4j2-management.xml

Build the project with Docker :

cd /var/work/cas/cas-management
chmod +x *.sh
./docker-build.sh

When it is over, you can launch the container :

./docker-run.sh

The CAS management page can be found at this URL :

https://$DOMAIN_NAME:8443/cas-management

In our example the URL is :

https://castest.datafari.com:8443/cas-management
CAS management UI

With this test configuration our CAS server will authorize all applications.

ANNEXES

  • Java installation
apt-get install -y wget apt-transport-https gnupg
wget -O - https://packages.adoptium.net/artifactory/api/gpg/key/public | apt-key add -
echo "deb https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list
apt-get update
apt-get install temurin-11-jdk
  • Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh --dry-run
  • jq
apt-get install jq

Bye bye DIH – Hello Datafari

Replacing DIH with ManifoldCF easily with Datafari

So you were using DIH with your Solr, and you are worried that it may not be maintained actively anymore ? And you have difficulties to find a replacement or an alternative ? We propose here a replacement that relies on Apache ManifoldCF and Datafari, projects that have been actively maintained and updated for several years now.

Datafari is an open source Enterprise Search solution, that – among other things – embeds Apache ManifoldCF and Apache Solr. As such, by installing it you are just some scripts away from having a fully functional DB crawler that fetches the data and sends it to an Apache Solr. Which is exactly what DIH was doing! As a bonus, ManifoldCF can do much more as it proposes plenty of connectors for different sources, and graphical capabilities to configure your crawling (SLAs, time windows, data processing…).

So hop in, and give a look at our DIH replacement tutorial on the Datafari wiki.

Using Datafari to extract text for academic research on NLU and NLP

Extracting raw text to do Natural Language Understanding (NLU) or Natural Language Processing (NLP) is often a boring and time consuming task. Any student or researcher that has already had to prepare a pipeline for that knows what we are talking about. First, assess available open source technologies (very often Apache Tika), then understand how it works, put documents in a folder and make it work with trial and errors, probably through a python script.

This is what we had in mind when preparing a documentation on how to use Datafari Community Edition just for that. After all, Datafari is an enterprise search solution, which means it encompasses these tasks as part of its overall mission to index documents and allow to search through them.

With the documentation we provide, researchers will be able to have a fully operational pipeline that will look in a specific shared folder, extract the text (via Apache Tika), and ouput it in a dedicated folder. And with a bit more motivation, researchers can go beyond and use other connectors than the fileshare, as the pipeline can work with any data source.

Discover now how to extract text from any document thanks to Datafari.

Tutorial – Deploying Solrcloud 8 on Amazon EC2

In this tutorial, we will be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 8.6.2, Zookeeper 3.5.7 on Debian 10 instances.
This tutorial explains step by step how to reach this objective.

We will be installing a set of 3 machines, with 3 shards per server, which gives us a total of 9 shards. The replication factor is 3.
We will also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we are at the indexing phase or at the querying phase:

  • Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
  • Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.
Continue reading

Entity Extraction Using the Tagger Handler (aka SolrTextTagger)

With its release 7.4, the Solr team integrated SolrTextTagger into the core of Solr. This tool that has been maintained separately for years (https://github.com/OpenSextant/SolrTextTagger) is now packed into Solr, and ready to use through a dedicated handler. In this blog we will first step you through the configuration steps to set it up. Those are presented into Solr’s documentation (https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html) but we will repeat them here for the sake of completeness. And then we will present ideas on how to use it into your indexation and search pipeline so as to enhance the search experience of the users.

How does the tagger works ?

The tagger handler relies on a dedicated collection in which it stores the entities to be extracted. In this collection, one field is used to store the texts used to recognize each entity, and you may create as many other fields as you want to store other useful information about your entities.

Continue reading

Entity Extraction in Datafari

In this tutorial, we will demonstrate how to do basic entity extraction in Datafari Community. This post is inspired from https://lucidworks.com/2013/06/27/poor-mans-entity-extraction-with-solr/

Note that for Datafari Enterprise, all the configuration is already done. You just need to add your custom rules in a specific UI, and for further advanced functionalities, Datafari Enterprise allows you to benefit from SolrTextTagger and 3rd party semantic entity extractors.

We want to extract 3 entities in our dataset (files from the Enron dataset in this example) :

  • Persons
  • Phone number
  • If the document is a resume

Continue reading

How to upgrade a SolrCloud cluster – Tutorial

Let’s say that we have a SolrCloud cluster using Solr 4.X. Now we want to upgrade our Solr cluster and to have a modern Solr version as Solr 6.X, how can we do it ?

Well, there are many ways to do it. The cleanest is to install directly the new version of Solr, to adapt the configuration files and to reindex all data. But in production, it is often not acceptable to do that.

In this tutorial, we will upgrade in two steps : from Solr 4 to Solr 5 and then from Solr 5 to Solr 6. It is not possible to upgrade directly from Solr 4 to Solr 6 i.e. between 2 major versions because the index format changes and Solr can only read an index format from the parent major Solr version.

Continue reading

Tutorial – Deploying Solrcloud 6 on Amazon EC2

UPDATE: This tutorial is based on Solr 6. If you want to use Solr 8, we strongly recommend to use our recent blog entry to set up Solrcloud 8 on Amazon EC2

In this tutorial, we will be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 6.6.0, Zookeeper 3.4.6 on Debian 8 instances.
This tutorial explains step by step how to reach this objective.

We will be installing a set of 3 machines, with 3 shards per server, which gives us a total of 9 shards. The replication factor is 3.
We will also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we are at the indexing phase or at the querying phase:

  • Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
  • Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.

Continue reading

Generating big data sets for search engines

NOTE: This is the English version. You will find the French version further down in this article.

When proposing our expertise search, we are often asked to do performance evaluations on large datasets, for instance in Proof of Concepts. For a recent customer request, in order to gain time and to not use sensitive customer data, we have used log-synth, a random data generator developed by Ted Dunning. We are describing here how to use log-synth in order to generate a 100.000 lines data set.

The first step, which we don’t document here, is about downloading log-synth, unzipping it and building it with maven.

Continue reading

Tutorial – Deploying Solrcloud 5 on Amazon EC2

UPDATE: This tutorial is based on Solr 5. If you want to use Solr 8, we strongly recommend to use our recent blog entry to set up Solrcloud 8 on Amazon EC2

NOTE: There is a French version to this tutorial, which you’ll find on the second half of this blog entry.

In this tutorial, we’ll be setting up a Solrcloud cluster on Amazon EC2.
We’ll be using Solr 5.1, the embedded Jetty, Zookeeper 3.4.6 on Debian 7 instances.
This tutorial explains step by step how to reach this objective.

We’ll be installing a set of 3 machines, with 3 shares and 2 replicas per shard, which gives us a total of 9 shards.
We’ll also be installing a Zookeeper ensemble of 3 machines.

This architecture will be flexible enough to allow for a fail-over of one or two machines, depending on whether we’re at the indexing phase or at the querying phase:

  • Indexing: a machine can fail without impacting the cluster (the zookeeper ensemble of 3 machines allows for one machine down). The updates are successfully broadcasted to the machines still running.
  • Querying: two machines can fail without impacting the cluster. Since each machine hosts 3 shards, a search query can be processed without problems, the only constraints being a slower response time due to the higher load on the remaining machine.

Continue reading