Applications get complex and are needed to be managed on a large scale in order to ensure that your infrastructure stays operational. You should have a way of knowing how your applications are running, how the resources are being utilized, and the growth that takes place. Typically you have, let's say multiple servers running containers on them. As the user input grows, it makes sense to distribute these services individually, getting us to a microservice infrastructure. Now, if services want to connect with each other, there should be some sort of a way for them to be interconnected.
Let's say your application stops working. You are not aware of what went wrong, which component of your application caused the failure, and other information. Or let's say your application is responding very slowly as all the traffic is being directed to just limited servers. That is a place no one would want to be in. As debugging this manually is going to be very time consuming.
So how do you ensure that your application is being maintained properly, and is running with no downtime? We need some sort of an automated tool that constantly monitors our application and alerts us when something goes wrong (or right depending on the use case). Now, in our previous example, we would be notified when a service causes failure, and hence we can prevent our application from going down.
Prometheus is an open-source monitoring & alerting tool. It was originally built by SoundCloud and now it is 100% open-source as a Cloud Native Computing Foundation graduated project. It has become highly popular in monitoring container & microservice environments.
- Target - It is what Prometheus monitors. It can be your. aplications, servers, etc.
- Metric - For our targets, we would like to monitor particular things. Like for example, if we have a server (target) we would want to monitor the number of errors on the HTTP endpoints exposed (metric).
- Time Series Database (TSDB) - Stores the metrics data. It also ingest it (append only), compacts and allows querying efficiently.
- Scrape Engine - Pulls the metrics (description above) from our target resources and sends them to the TSDB. (Prometheus pulls are called scrapes).
- Server - Used to make queries for the data stored in TSDB. This is also used to display the metrics in a dashboard using Grafana/Prometheus UI.
The metrics are defined with
HELP attributes to increase readability.
HELP- It provides us with the description about the metric.
TYPE- Even tho Prometheus offers 4 core metric types to keep things simple, it allows us to create tags within those metric types for more specific use cases. The 4 core metric types are:
- Counter - As the name suggests, it is used to maintain a count of the metrics. This can be, let's say, number of requests, errors, etc. Note: Do not use this type if the value of your metric can decrease.
- Gauge - It is best suited for metrics that can go up &. down, like CPU usage.
- Histogram - A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
- Summary - Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
The Data Retrieval Worker pulls the data from the HTTP endpoints of the targets on path
/metrics. Here we notice 2 things:
- The endpoints should expose the path
- The data provided by the endpoint should be in the correct format that Prometheus understands.
Q. How do we make sure that the target services expose /metric & that data is in correct format?
A. Some of them expose the endpoint by default. Ones that do not, need a component to do so. This component is known as an Exporter. An Exporter does the following:
- Fetch data from the target
- Convert data into a format that Prometheus understands
- Expose the /metrics endpoint (This can now be retrieved by the Data Retrieval Worker) For different types of services, like APIs, Databases, Storage, HTTP, etc, Prometheus has a list of Exporters you can use.
Let's say you want to monitor an application you have written in Java, you can use Client Libraries for that. It lets you expose application metrics via an HTTP endpoint
/metrics on your application’s instance which can then be used to send data to the Metrics Server. In the official documentation, a list of various libraries has been provided, with information on how to create your own.
As mentioned above, Prometheus. uses a pull mechanism to get data from targets. But mostly, other monitoring systems use a push mechanism (we'll see what that is in a bit). How is this different and what makes Prometheus so special?
Q. What do you mean by push mechanism?
A. Instead of the server of the monitoring tool making requests to get the data, the servers of the application push the data to a database instead.
Q. Why is Prometheus better?
A. You can just get the data from the endpoint of the target, by multiple Prometheus instances. Also note that this way Prometheus can also monitor whether an application is responsive or not, rather than waiting for the target to push data.
(Checkout the official comparison documentation)
NOTE: But what happens if the targets don't give us enough time to make a pull request? For this, Prometheus uses the Pushgateway. Using this, these services can now push their data to the Data Retrieval Worker instead of it pulling data like it usually does. Using this, you get the best out of both the ways!
Now that we know how Prometheus works, lets take a look into how we actually use it. So we mentioned about targets, metrics and all sorts of things. Where do we define those? Answer, in a config (yaml) file.
Q. When you define what targets you want to collect data from in the file, how does Prometheus find these targets
A. Using the Service Discovery. It also discovers services automatically based on the application running.
(Check the official documentation for configuration)
global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first.rules" # - "second.rules" scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090']
scrape_intervaldefines how often Prometheus is going to collect data from the targets mentioned in the file. This can of course be overridden.
rule_files- This allows us to set rules for metrics & alerts. These files can be reloaded at runtime by sending
SIGHUPto the Prometheus process. The
evaluation_intervaldefines how often these rules are evaluated. Prometheus supports 2 types of such rules:
- Recording Rules - If you are performing some frequent operations, they can be precomputed and saved in as a new set of time series. This makes the monitoring system a bit faster.
- Alerting Rules - This lets you define conditions to send alerts to external services, for example, when a particular condition is triggered.
scrape_configs- Here we define the services/targets that we need Prometheus to monitor. In this example file, the
prometheus. Meaning that it is monitoring the target as the Prometheus server itself. In short, it will get data from the
/metricsendpoint exposed by the Prometheus server. Here, the target by default is
localhost:9090which is where Prometheus will expect the metrics to be, at
Prometheus has an Alermanager that can be used to send alerts to you via Emails, mailing lists, etc. As mentioned above, Prometheus server uses the Alerting Rules to send alerts.
Prometheus stores it on disk, this can be a local database or remote. The data is stored in a time-series format so that one cannot write data directly.
Let's take the example of the configuration file (
config.yml) above that monitors the Prometheus server running on our machine.
(Checkout the README.md file for more information)
$ mkdir -p $GOPATH/src/github.com/prometheus $ cd $GOPATH/src/github.com/prometheus $ git clone https://github.com/prometheus/prometheus.git $ cd prometheus $ make build $ ./prometheus --config.file=your_config.yml$ mkdir -p $GOPATH/src/github.com/prometheus $ cd $GOPATH/src/github.com/prometheus $ git clone https://github.com/prometheus/prometheus.git $ cd prometheus $ make build $ ./prometheus --config.file=config.yml
In the next blog we'll be looking at a few more examples of using Prometheus to monitor your Kubernetes resources, and Thanos!