DEV Community

loading...

Thanos - Scale Your Prometheus Monitoring

Kunal Kushwaha
Founder Community Classroom. CNCF Ambassador. Kubernetes Release Team. GitHub Campus Expert. MLH Coach & Fellowship Team. Student PM DoK. Dev Advocate Civo. GSoC Mentor. Gold MLSA. YouTuber.
・6 min read

What is Monitoring?

Applications get complex and are needed to be managed on a large scale in order to ensure that your infrastructure stays operational. You should have a way of knowing how your applications are running, how the resources are being utilized, and the growth that takes place. Typically you have, let's say multiple servers running containers on them. As the user input grows, it makes sense to distribute these services individually, getting us to a microservice infrastructure. Now, if services want to connect with each other, there should be some sort of a way for them to be interconnected.

What is Prometheus?

Screenshot 2020-11-05 at 3.06.53 PM

Prometheus is an open-source monitoring & alerting tool. It was originally built by SoundCloud and now it is 100% open-source as a Cloud Native Computing Foundation graduated project. It has become highly popular in monitoring container & microservice environments.

Checkout my Blog on Prometheus to learn more about it.

Limitations of Prometheus

  • High Availability: Let's say that we have a scenario where a number of microservices are being scraped by one Prometheus server. Now imagine that this server goes down.
    Alt Text
    One way we can solve this is by using 2 Prometheus servers, but that is now consuming more resources.
    Screenshot 2020-11-05 at 2.42.31 PM

  • Storage Limitation: Prometheus has a limitation of 15 Days for TSDB. We also need long term storage for our metrics incase our disk gets corrupted and we lose our data.

  • Scaling Prometheus: We tend to scale Prometheus using federation. And we end up going from the main cluster to the leaf clusters in order to get fine details. If we add HA to it, now we will scrape double amount of data from the leaf clusters. Hence, we are using a lot of bandwidth.
    Screenshot 2020-11-05 at 2.43.50 PM

What is Thanos?

Alt Text
The use case of Thanos is to deal with these limitations of Prometheus at a global scale. Thanos is a highly available metric system that contains a set of composable components that can be used to bring HA, Scalability, unlimited storage capacity, on top of existing prometheus setups.

Aim of Thanos:

  • Prometheus Compatible
  • Global Query View of Metrics - To see metrics to different clusters.
  • Unlimited Retention - To see long term data down to individual samples.
  • Downsampling and Compaction of Data to make Queries efficient.

Fixing Scalability Issues #1

In the previous example we were ingesting data from the leaves, but what if instead of doing that, we query data at runtime? To do this we need a Query Component alongside the main cluster. Now, our service becomes scalable. One limitation we have here is that everytime we get a request, we have to proxy it to every single cluster. Say the data we need belongs to cluster 2, it will however still query all the other clusters, which is inefficient.
What if we add a component (Sidecar) that sits along side Prometheus that announces the metrics in its TSDB's local cluster. Now the Query component can know ahead of time which cluster to look into for data.
Screenshot 2020-11-05 at 3.01.37 PM

Querier/Query:

  • The Query component is stateless and horizontally scalable and can be deployed with any number of replicas.
  • It automatically detects which Prometheus servers need to be contacted for a given PromQL query.

Sidecar:

  • Thanos integrates with existing Prometheus servers through a Sidecar process, which runs on the same machine or in the same pod as the Prometheus server.
  • The purpose of the Sidecar is to backup Prometheus data into an Object Storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API.

Store API:

All the things in Thanos expose this API. It allows you to query the metric data stored in Prometheus.
Screenshot 2020-11-05 at 3.04.43 PM

Fixing High Availability Issues

The above solution also solves the HA issue for the federated instances. Here Grafana can use two Query components in case one goes down, the other one can be used to serve traffic. Same thing can be done at leaf clusters.
But the problem here is that now we need to get the data from both of those Sidecars, combining them in memory in Querier and sending it back to Grafana. To solve this, we can use the Query component to combine the data and send it back.
Screenshot 2020-11-05 at 3.09.39 PM

Fixing Storage Limitation Issues

One of the ways to solve this is to use Object Storage as it is cost efficient, flexible, and scalable. One way to do is this is to have the Sidecars upload TSDB blocks to object storage. But we need a way to expose the long term stored data to the Query API. We can use the Store Gateway component to do that.

Store Gateway:

The store gateway exposes StoreAPI and needs to be discovered by Thanos Querier. You can decrease Prometheus retention and store less locally by backing up data into the object storage of your choice. The store gateway allows us to query all that historical data again by implementing the same gRPC data API as the sidecars backed up with data it can find in your object storage bucket.

Compactor:

Imagine you query data from a really long time, which will use a ton of bandwidth and a lot of disk reads. Instead, we can query for data every few hours i.e. downsampling the data. For this we use the Compactor component.

Screenshot 2020-11-05 at 3.14.19 PM

Fixing Scalability Issues #2

instead of providing a port and managing query endpoints, you can just remote write your data to a new component (Receiver). This component exposes the StoreAPI so that when it sees a remote write request it stores these metrics in some local TSDB that its running in memory. The Receiver is made into a scalable hashring which will upload the data to object storage.

Screenshot 2020-11-09 at 6.07.37 PM

Receiver:

It receives data from Prometheus’s remote-write WAL, exposes it and uploads it to object storage.

Ruler/Rule:

Thanos Ruler allows you to have alerts or recording rules that requires global view. It does that on top of a given Thanos Querier.

Final Picture!

Screenshot 2020-11-09 at 6.03.19 PM

Running it locally

Dependencies Required:

  • One or more Prometheus v2.2.1+ installations with persistent disk.
  • Optional object storage: Thanos is able to use many different storage providers, with the ability to add more providers as necessary.

If you want to build Thanos from source you would need a working installation of the Go 1.15+ toolchain (GOPATH, PATH=${GOPATH}/bin:${PATH}).
In order to use Thanos' components, you would require a Thanos binary that you can get by running: go get github.com/thanos-io/thanos/cmd/thanos

Checkout the Thanos Docker Compose repository for getting started quickly.

Thanks for reading!

Connect with me

Resources

Discussion (0)