Preslav Mihaylov

Posted on Apr 27, 2020 • Edited on Nov 14, 2020 • Originally published at pmihaylov.com

Getting The Most Out of Your Logs with ELK

#microservices #logging #elk #elasticsearch

When you start developing your application, you typically instrument it with some logging to be able to debug problems later.

Some skip it in the development phase, but once the application hits production, then it is crucial to have some logging.

After all, once users complain that something isn't working, how would you be able to find the root-cause?

And although logging proves to be useful, many companies don't really capitalise on its potential as they're still clinging to the classic way of writing freestyle logs and grep-ing them on their prod machines afterwards.

However, there is so much more potential that logging holds for monitoring our production systems. In this article, I will show you how to get the maximum value from your logs using the ELK stack.

The Classic Way of Logging

What I refer to as the classic way to log looks like this:

INFO - 2020-01-08 12:34:03 - Received database error with user 53...

The amount of debug details present in the logs varies and depends on what the developer decides to include. Sometimes, the logs might include other metadata such as e.g. user's country, endpoint, http method, stack trace, etc.

This all depends on the developer who prints out the log line.

So how do you really debug an issue once it occurs?

It typically goes by you ssh-ing to the production machine and fetching the application's logs locally. And then, using the good-old grep, you try to extract the key log lines you need & start tracing what's happened with the IDE on the side to see where that log occurred.

This approach can get you quite far and works well for small to medium applications. But once your application starts evolving, it starts getting in the way.

The Problems with the Classic Approach

The more complex your system becomes and the more logs it starts popping out, the harder it becomes to identify the problems within your system.

In order to simply start looking for the problem, you have to login to your prod machine, download the log, filter out all the lines you don't need via some advanced grep usage, open the IDE, get a sheet of paper to "trace the log history", etc.

This can work well when you have enough time & patience, but imagine that they wake you up during the night and you have to go through this entire drill to simply start looking for the problem.

This disturbance and overhead will be very hard on developers. This is why, for these occasions, you should have proper tools to quickly filter out the noise & identify the problem fast.

Especially for larger companies, if their service is down for an hour, that might pose huge financial risk.

This problem gets even more problematic once your application starts to scale & you have a ton of micro services set up. That will bring even more noise to your logs and identifying the problem will be even harder.

So, now that your team is starting to experience these pains of growth, what options do you have?

Introducing Structured Logging

Structured logging is a simple concept, which gives your logs the ability to be parsed & analysed by a monitoring tool.

To put it simply, a structured log is one which has a schema and that schema is typically represented in JSON format. Here's an example:

{"level":"info","ts":1587291719.4168398,"caller":"my-application/main.go:76","msg":"Inbound request succeeded","endpoint":"/payments/execute","method":"GET","countryISO2":"BG","userID":"42","paymentMethod":"VISA","userType":"business"}

What you see in the example above, this time, is that the log is structured in a JSON format with a set of keys and values. Those key-value pairs hold some kind of metadata information such as a user's country, payment method, user type, etc.

This is just an example of course. The keys can be domain-specific, based on your application's business logic.

Now, if the developers are diligent, they can enrich all those logs with the same amount of metadata with the classical logging approach as well.

But the great benefit of this approach, is that now the logs can be processed by other analytical applications. At the very least, you'll be able to grep your applications more easily as you can e.g. easily filter out the fields you don't need based on your log's schema.

However, if you use structured logging just to make grep-ing easier, then that's like using a windmill for milling corn.

In order to get the most out of structured logging, you need the right tool for the job.

Analysing your Structured Logs using ELK

ELK stands for Elasticsearch, Logstash & Kibana. These are a bundle of free tools, all maintained by the same company - Elastic. They are typically used together in order to provide you with a monitoring system based on your logs.

Learning the details of how these components work together is not as important as understanding the value they bring. You can study the details of how they work and how to set them up later. We have to get you sold first.

The end result, is a monitoring dashboard which looks like this:

The dashboard is fully customisable and has support for all kinds of views - graphs, histograms, vertical bars, pie charts, etc.

The example I've provided above is of an application which resembles a payment provider and has a couple of endpoints, some of which have errors in them.

And the cool thing is that this dashboard is 100% based on those structured logs I showed you previously. ELK simply parses that input & visualises it in this pleasant way.

Having a dashboard like this set up, with a single gaze you can:

Evaluate the success to error ratio
Evaluate which endpoints have the most errors
What kind of errors is your application throwing
Which input parameters (e.g. country, user type, payment method) cause the highest % of errors

But there's more to it than simply an overview of your system's status.

Zooming in on your data

With a single click you can "zoom in" on any input value and see what the data shows for that input.

For example, if I want to see the success/error details of the endpoint /payments/authhold, all I need to do is filter that value:

And then, the dashboard will refresh and only show you the details of that chosen endpoint.

But not only can you set up an aggregated view of your logs.

You can easily add a detailed view of all your log lines. They are free for you to inspect the same way you do with plain logs & grep-ing.

But still, a bit more convenient:

Now, which approach do you prefer for finding the root cause of an outage? Especially when they wake you up in the middle of the night.

Conclusion

If you've been writing & grep-ing plaintext logs until now, this article should be a mini-industrial revolution for your root cause analysis workflow.

Integrating ELK in your application can make your life and the lives of your teammates much easier. And also pleasant when it comes to debugging issues in production.

However, this article only explains how structured logging and ELK work conceptually. If you're hooked, then you should spend some time afterwards to read more about how ELK works and how to integrate it with your application.

Are you interested in a follow-up article, walking you through this?
Then let me know in the comments section below.