The Journey of a Data Platform (2 Part Series)
I'm writing a series on how to build a Data Platform from scratch and it's time for Part 2! In Part 1 I explained how to start building your data platform. But, when your infrastructure grows, making sure that everything is working as expected becomes a challenge. And as one of my dearest colleague tells me all the time: "monitoring and logging is an art!".
In Part 2, I want to tell you how to setup a production-grade monitoring system for your infrastructure. Concepts and caveats are more valuable than some "copy-paste" piece of code, for the simple reason that it is important to understand why certain choices are made.
Previously, I mentioned that we are using a bunch of tools. When I started listing all of them I thought: wow, so many things for something that looks relatively simple in concept. I mean, what we want to achieve is to be able to see logs and act if something happens. Simple right? Well, not quite. To refer back to my colleague: monitoring and logging is definitely an art :)
Let's start by dividing the problem in sub-problems. In school they taught us that when the problem is too big, we have to
divide it and
conquer it (aka
divide et impera if you are into latin).
Since we want to understand what it's going on in our infrastructure and applications, we should start tackling logging at first. Generically speaking, an application writes logs on the STDOUT. Since we are using Kubernetes, that means we are able to read the logs from the
logging console that comes with Kubernetes itself. Because we are able to read logs, it means we can collect them. How can we collect them? As I mentioned in the previous article we use Elasticsearch for indexing the logs. In order to collect those we use Fluent-Bit.
Fluent Bit allows collection of information from different sources, buffering and dispatching them to different outputs. The helm chart we use runs as a
daemon set in Kubernetes (here more info about daemon sets). This guarantees that there will be an instance of Fluent Bit per machine in the Kubernetes cluster. The process will collect the information of each pod in kubernetes from the
standard output and redirect them towards another system. In our case we chose Kafka.
Now our logs are securely sent towards to a topic in Kafka and ready to be consumed by something else.
Applications write logs on the standard output, but
machines don't write logs, right? So, how do I know how a machine is behaving? How do I know if the
CPU is sky-rocketing or the
I/O operations on the disk are the bottleneck or my application? Well, that's where Node-Exporter comes into play.
Node Exporter is able to collect metrics from the underlying Operating System. This is powerful because now we are able to collect the system information we needed. Once again, there is a helm chart coming and rescuing us.
Cool, but what if an application is able to give me more information than simple logging. For example, what if my database is able to give me the current system information such as
memory consumption or
average query latency. That's hard. These are not logs nor metrics coming from a machine. Although, they are available for us to be used. That's when Prometheus enters the arena.
Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. BINGO. This sounds like the tool that will do a lot of things for us. But where does it stand in the big picture? Let's take a look at the below image
I took this picture from this great article, and it describes clearly what Prometheus does. Basically, its role is to pull and push information. But what is relevant to know, is that Prometheus standardized the way information should be generated so that they can be parsed and, in a later stage, queried.
Because our data platform has multiple Kubernetes clusters (remember controlplane, production, development, etc), Prometheus needs to be installed in all of them. Thanks to the awesome community of developers, there is an helm chart that we can use. This operator allows us also to use prometheus in
federation mode, which is very important in this context. The federation concept allows the prometheus in Controlplane to
scrape the information from the other prometheus services so that we can centralize all the metrics in one unique point.
We decided to create
controlplane for centralizing the information regarding the other environments and have an overview of what it's going on in our platform. Since we pushed our logs into Kafka, we now need to consume them and store them in a format that is readable to humans.
There is a famous acronym called
ELK and it stands for Elasticsearch, Logstash, Kibana. So far we mention the E and the K but never the L. Well, that time just arrived.
Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favourite "stash." This is part of the Elastic suit, and it is a fundamental piece for making sure that we are able to have the same type of logging for everything that comes in.
Our input is the Kafka topic we mentioned before and our output is Elasticsearch where then the data will be indexed and "stashed". The helm chart helps you to install the application and by modifying this part of the
values.yml you are able to easily read from Kafka. The major issue we found was in the
@timestamp field. In fact, we had to adapt the
values.yml a little to avoid having issues in reading the timestamp.
The following snippet of code will help you to solve such an issue
You have to modify the Timezone accordingly but that's the major reason why we couldn't have our data ingested in Elasticsearch correctly.
We just finished covering the logging part, but how do we visualize everything? There are two main applications: Kibana and Grafana. We use Kibana to explore all the logs that coming in from all the application. Without Kibana, it would be extremely hard to debug your application because searching for what is going on, it's very hard with
kubectl logs -f <pod-name> | grep whatever-error command :)
Grafana helps in visualizing all the metrics coming in from Prometheus. There are a ton of pre-made dashboards that you can just install and use them. The only thing you need to do is to setup the
prometheus installed in
controlplane as your
data-source in Grafana and that's it. All the metrics will
automagically be available for you.
This is the toughest part of the process. Once again, the concept is simple but deciding the thresholds onto when receiving an alarm is difficult, and it needs to be tuned along the way. I would recommend to start from this awesome website and start building the rules that are important for your data platform.
I can't help you with building the rules that you need for your data platform, but I can give you advice on selecting the right tool to notify you and your team when an alarm is triggered. My suggestion is to send the notification to Slack for the alarms that you consider "minor". I leave the definition of minor up to you. We only send a slack notification for the
development environment and for those applications that are not public in
For production systems we use PagerDuty to create a rotation-calendar for taking care of the systems among team members and to make sure that everything is always up-and-running. There is a great integration with Prometheus that I highly recommend to setup.
Grafana also helps with alerting but we haven't used it yet. It looks awesome though. If you've been using Grafana, it would be great if you can share your experience with me in the comments below :)
In this long blog-post I gave you an overview the tools that my team and I are using for our data platform. I hope this gave you more ideas on how to start! You will encounter problems along the way because "Rome wasn't built in a day". I do hope that you have all the initial information to collect, visualize and receive alarms in your data platform. But remember my colleague's motto:
logging and monitoring is an art :)