Exploring logging strategies with the Elastic Stack

#devops #elasticsearch #logging #observability

Written by Guido Lena Cota
Originally published on June 9th 2021

Design options and trade-offs to consider when developing a log ingestion strategy with the Elastic stack

Imagine your team is responsible for business-critical IT services. Say you operate the shared DevOps platform that enables developers in your company to roll out new features and hotfixes. Think of the version control system, CI/CD tools, deployment environments, and so on. These services can run on multiple environments, and their number and type can change over time (e.g., by adding contract testing integration to your CI/CDs). Part of your daily job is to keep a close eye on these services, monitoring their performance, detecting anomalies, and doing a speedy root cause analysis when everything’s on fire. The Elastic Stack offers a unifying solution for all these activities. As illustrated below, Beats modules can collect logs and metrics on the hosts and ship them to Logstash where they are parsed, sanitized and enriched. Then, Logstash indexes (persists) the processed data in the Elasticsearch search and analytics engine, which your team can query using the Kibana UI.

So, that's it, right? What can go wrong? Plenty. In a previous series of blog posts, we have already talked about common issues - and possible solutions - when analyzing logs in Kibana. Issues like dealing with missing log entries, bad search performance and usability. This time, I'd like to step back in the ingestion process and address related, but distinct challenges, focusing on how we create and configure the Elasticsearch indices for our logs. We will get into the shoes of the DevOps team in our use case scenario and uncover the trade-offs between resource efficiency, reliability, and maintainability when choosing a log indexing strategy. For example, how can we enforce different retention policies for dozens of log sources without getting into configuration hell? How flexible is our strategy to accommodate new log sources with the least amount of effort? What number and size of indices will optimize ingestion throughput, disk utilisation and analysis performance?

This blog post will present two log indexing strategies and evaluate them based on maintainability and efficiency criteria. As we will see, none of these strategies will be entirely satisfactory, but they will build the case for a third log indexing strategy… But that I'll leave as a cliffhanger for a second, conclusive blog post.

Chasing the right log indexing strategy

We chose the Elastic Stack to centralize logging of a heterogeneous group of services running on multiple environments. We set up Filebeats shippers to send log streams over to a Logstash server where we configured pipelines to normalize the diverse log formats to a more consistent structure to enhance correlation (see the Elastic Common Schema for inspiration). In particular, every processed log event will have a field for the service name and another for the environment. Finally, the Logstash pipelines use the Elasticsearch output plugin to ship the processed logs as JSON documents to the specified indices in the Elasticsearch cluster.

In the context of this blog post, a log indexing strategy is defined by (i) an index name pattern, (ii) an alias for all the indices matching the same name pattern (also called backing indices), and (iii) a log rotation policy. A quality strategy is easy to implement and maintain, adapt to changes to the service landscape, and efficiently use the Elasticsearch resources (disk, CPU, memory).

S1: Group by environment - rotate by age and size

The core idea is to group services by environment and write their logs to the same index alias. Have a look at the diagram below. On the left, we see the JSON documents transformed by Logstash from Jenkins and Nginx production logs. Note that these documents have some common fields, notably, host.env, the name of the environment where the service is running. A representation of the target index alias (“logs-prod-2021”) and its backing indices is on the diagram's right. When you write to an alias, you write to the only backing index that allows writes, to avoid inconsistencies and duplication. Conversely, when you read from an alias, you read from all the backing indices.

In our proposal, the index name pattern for the backing indices has four parts separated by a dash (“-”): a prefix to describe the domain (“logs”), the name of the environment as set in the incoming documents, the current year to make the temporal dimension explicit, and, finally, the zero-padded incremental number that enacts the log rotation. The resulting names are, for example, logs-prod-2021-000001, logs-prod-2020-002035, logs-stg-2021-000048.

The last piece of information required for a log indexing strategy is the rotation policy. Elasticsearch offers a feature called rollover that creates a new backing index for a target alias when specified conditions are met. The available rollover conditions are based on index age, size, number of documents, or a combination of these three. The name of the new backing index is a unitary increment of the last part of the name pattern - e.g., logs-prod-2021-000001 -> logs-prod-2021-000002. To automate rollover, we can create an Index Lifecycle Management (ILM) policy to tell Elasticsearch what actions to perform at every phase of the index lifecycle. Check out the Tutorial: Automate rollover with ILM on the Elastic documentation for more details.

Coming back to our first log indexing strategy, we want to rotate log indices when they either:

become older than three months because they are not actively searched anymore and can be moved to less performant nodes, or
bigger than 50GB.

Implementation Details

The steps to apply the indexing strategy S1 are as follows:

Create an ILM policy (“ilm_s1”) in Elasticsearch to specify the desired rollover conditions.

PUT es_hostname:9200/_ilm/policy/ilm_s1 
{ 
  "policy": { 
    "phases": { 
      "hot": {                                 
        "actions": { 
          "rollover": { 
            "max_primary_shard_size": "50GB",  
            "max_age": "3M" 
          } 
        } 
      } 
    } 
  } 
}

Configure the Elasticsearch output plugin of the Logstash pipelines to (i) apply the ILM policy above, (ii) define the index alias and (iii) index name pattern. For example, for production logs:

elasticsearch { 
  hosts => ["es_hostname"] 
  ... 
  ilm_rollover_alias => "logs-prod" 
  ilm_pattern => "{now/y}-000001" 
  ilm_policy => "ilm_s1" 
}

Because ilm_rollover_alias does not support dynamic variable substitution (i.e., the ability to set the dynamic value of a field such as host.env into a string template), we must create one output configuration for each known environment and control their application with IF-ELSE statements.

`output { 
  if [host.env] == "prod" {   
    elasticsearch { 
      ilm_rollover_alias => "logs-prod" 
      ... 
    } 
  } else if [host.env] == "stg" {   
    elasticsearch { 
      ilm_rollover_alias => "logs-stg" 
      ... 
    } 
  } else ... 
}`

Not ideal, right?

Evaluation

Pros of the log indexing strategy S1:

Straightforward implementation.
Support complex log rotation policies, thanks to the integration between Elasticsearch output plugin and ILM.
Small number of indices and active shards (low pressure on the CPU).

Cons:

Risk of index mapping explosion because logs with different schemas are stored into the same index. This will lead to larger memory and disk footprints and slower queries.
Poor flexibility of the index mappings. For example, if you want to change the data type of one field of the Jenkins logs, you must reindex the documents of all the other services, as they are stored in the same environment-scoped indices.
One ILM policy and rollover strategy for all the services running in the same environment. Say you must keep Nginx authentication logs for 5 years for security compliance. Without ad-hoc clean-up routines to delete stale documents (not a best practice), you must also keep on disk logs of services with a shorter retention period.
Boilerplate and duplications in the Elasticsearch output plugin configuration, which require manual and tedious maintenance.
A spike in the number of logs collected from one service (e.g., during an incident, when enabling debug logging) may overburden the indexing capacity of the target index and cause data loss.

S2: Group by service and environment - rotate by age

Having introduced all the important concepts in the previous section, the presentation of the second log indexing strategy will be quicker. I promise. In strategy S2, we introduce dedicated indices per environment and service to overcome the drawbacks of having heterogeneous service logs into the same index. The simplest way to achieve this is to create environment-service indices every month to enforce temporal log rotation, and route the JSON documents to the pertinent indices. The resulting indices will be named something like logs-prod-jenkins-2021.03 or logs-dev-nginx-2020.12.

Implementation Details

To apply the indexing strategy S2, it suffices to specify the desired index pattern into the Elasticsearch output plugin “index” property, which will generate the target index name dynamically based on the values of the host.env and service.name fields.

elasticsearch { 
  hosts => ["es_hostname"] 
  index => "logs-%{host.env}-%{service.name}-%{+yyyy.MM}" 
  ... 
}

Note that Logstash uses Joda formats to extract temporal values from the current timestamp (e.g., yyyy to extract the year, MM for months, ww for weeks, …). The monthly rotation policy is entirely enforced by the index pattern above - no Elasticsearch rollover feature required.

Suppose you want to apply an ILM policy to specific environments or services. In that case, you have to create the ILM policy in Elasticsearch, then add that policy to an index template that will match the correct index name pattern.

PUT es_hostname:9200/_ilm/policy/ilm_s2_dev_logs

{ 
  "policy": { ... } 
}

PUT es_hostname:9200/_ilm/policy/ilm_s2_prod_jenkins_logs 
{ 
  "policy": { ... } 
}

PUT es_hostname:9200/_index_template/dev_logs 
{
   "index_patterns": ["logs-dev-*"],
   "template": {
     "settings": {
       "index.lifecycle.name": "ilm_s2_dev_logs"
     }, 
    ...
   },
   "priority": 201
 }

PUT es_hostname:9200/_index_template/jenkins_prod_logs 
{
   "index_patterns": ["logs-prod-jenkins-*"],
   "template": {
     "settings": {
       "index.lifecycle.name": "ilm_s2_prod_jenkins_logs"
     }, 
    ...
   },
   "priority": 201
 }

The index template’s priority field allows managing overlapping patterns: if an index matches more than one index template, the one with the highest priority is used. Because Elasticsearch ships with built-in templates having matching pattern logs-* and priority 200, we need to set the priority of our custom templates to any value greater than that.

Evaluation

Pros of the log indexing strategy S2:

Less boilerplate and duplicated code in Logstash thanks to dynamic variable substitution.
Service-specific indices have fewer fields, leading to faster queries, improved space efficiency, and fewer risks of mapping conflicts.
More granular control over the assignment of ILM policies.
Data corruption of one shard will affect only one service’s logs (small blast radius).

Cons:

Risk of index explosion: 12 new indices for each service every year.
Support only age-based rollover conditions, which may result in too big or too little indices, depending on the related service's indexing rate. This is not an efficient use of the resources, and there’s no way to control it in the S2 setup.
The index property of the Elasticsearch output plugin allows enforcing quite basic age conditions: daily, weekly, monthly or yearly rotations, but not - say - 3 months rotation like strategy S1.

A case for S3

How to combine the advantages of the log indexing strategy S1 (sophisticated log rotation policies, no risks of index explosion) and S2 (granular control over service-specific settings and behaviours, improved operations, efficient use of the cluster resource) while minimizing their drawbacks? That’s a great question, but for another blog post (bummer!). If you already have some ideas and can’t wait to share them, you are welcome to reach out to us at elastic@kreuzwerker.de.

Image by Ag Ku from Pixabay