At work, I recently had a need to put in place a scalable logging solution based around the ELK stack.

Issues with Multicast networking aside,  Elasticsearch scales pretty well on its own without the need for any additional overheads, however discovering whether a node is online or not and connecting only to available nodes can be tricky.

Scaling Logstash can be tricky, but it basically involves adding more Logstash servers to the mix and pointing them at your Elasticsearch cluster by defining multiple hosts in your Logstash configuration.

Kibana (like most web applications) can only have one Elasticsearch host defined in the config, so scaling out Kibana is more difficult.

The above raises the question – how do I know which Elasticsearch node to point my configuration at if I don’t know whether they are there or not.

The answer came in the form of consul.io.  If you’ve not looked at Consul before then I can highly recommend it as a good way of providing rudimentary cluster status and service advertising with very little overhead.

The concept behind Consul is that you have a daemon running on each of your servers that talks (via a gossip protocol) to other hosts on your network.  You have a consul server that all the nodes are aware of and then once they have seen each other via this central server then they can talk directly over the network.

The Consul daemon itself acts as a local DNS resolver for the datacentre and both advertises and load-balances (via DNS Round Robin) services across the appropriate hosts.  As an example, let’s suppose we have three Elasticsearch Nodes already configured in an Elasticsearch Cluster.  We put the following configuration into /etc/consul.d/elasticsearch.service.json:

{
"service" : {
"name" : "elasticsearch",
"tags" : [
"es"
],
"port" : 9200
}
}

and we restart consul. In seconds, we can now access our Elasticsearch servers via elasticsearch.service.consul and Consul will setup a health-check and serve a random IP address from the nodes that are available in a cluster every time we query that record.  Cool huh?

Now that we have a single DNS entry that we can point to, we need to configure our servers to talk to consul as well as the rest of the world.  Under Linux, this is very much a solved problem:

  • Install DNSMasq
  • Configure DNSMasq to talk to consul by adding “server=/consul/127.0.0.1#8600” to your DNSMasq config
  • Add “127.0.0.1” as a valid nameserver and “consul” as a valid search to /etc/resolv.conf

I use Ansible to install and configure all of these parts of the infrastructure however there’s nothing to stop you from using Puppet/Chef/CFEngine/Whatever to achieve the same goal.

We can now configure logstash to point to our elasticsearch cluster as easily as this:


output {
elasticsearch {
host => "elasticsearch.service.consul"
protocol => "http"
cluster => "<ELASTICSEARCH CLUSTER NAME>"
}
}

Now logstash will only connect to Elasticsearch Nodes that are online and serving data.

Configuring Kibana is just as easy just configure the relevant parts of kibana.yml to be:

elasticsearch_url: "http://elasticsearch.service.consul:9200"
elasticsearch_preserve_host: true

So, that’s Logstash and Kibana feeding in to Elasticsearch via Consul, what’s next?

Well, as we’ve seen, Consul can advertise pretty much any service we want to, so let’s be clever about our Logging.

Over the years, I’ve spent months painfully designing and implementing centralised logging clusters and the big problem is that usually you need a central logging server through which all traffic must flow.  This creates a Single Point Of Failure (SPOF) in your network because clustering rsyslog and getting servers to only send messages to log servers that are online is hard!

Consul abstracts this issue away for us (and I’m sure by now that most of you have seen where this is going!).

On the servers that are running Logstash, we advertise this service to the other nodes in our datacentre via Consul:

{
"service" : {
"name" : "logs",
"tags" : [
"logging",
"syslog"
],
"port" : 1514
}
}

Place the above into /etc/consul.d/logstash.service.json, restart Consul and all of your available logstash nodes will now be randomly chosen for you when you query logs.service.consul.

Using rsyslog? Great! Just add a catchall configuration to your rsyslog config as follows:

*.* @@logs.service.consul:1514

now all of your system logs are going to a known working logstash server, being load balanced via Consul’s rrDNS and then on to a known working Elasticsearch node, again via Consul’s rrDNS.

I’m sure many of you will have realised the massive advantages that this gives us when scaling now – we add a new node of a given type to Consul and, as long as the health-check passes, it is automatically available to all the other nodes in our network.

Consul also allows you to set a datacentre as part of the configuration, so if you were hosting across a mix of Telehouse, AWS and Rackspace, you could configure your servers to send logs to logs.service.<datacentre>.consul and then have a logstash forwarder in each DC send that data on to a centralised Elasticsearch Cluster.

I’m still working with these technologies ad trying to get the best out of them, so please leave your thoughts in the comments below or ping me on twitter (@proffalken) if you have any ideas on how to improve this model.

P.S. Although Consul works on Windows, I’ve not tried it yet.  That’s next on the list as we currently have a requirement to ship Windows logs to ELK and I’m starting to think that Consul is the best way to ensure we get a “known good” server when we try to send events.

NOTE: Someone asked on Twitter if you could use Consul for the Elasticsearch Autodiscovery in place of Multicast and at first I figured, sure, why not, however I now realise that it’s not the case.

The problem lies in the fact that Consul returns a random ip address via DNS for every query so every time Elasticsearch tried to find other members of the cluster it would get a different IP address but only one of the nodes it could pair with.  This would cause it to constantly hold an election for every cluster request as different nodes appeared and the disappeared.

In short, it’s a nice idea but the more you think about it, the less it works! 🙂