Let's assume we are working on an Application for months and the Application is getting complex day by day and are needed to be managed on a large scale in order to ensure that the infrastructure of your application stays operational. To ensure that your application is running well, You have to solve some Questions.
- Is your Application is Up or Down?
- Are the resources are utilized well enough or not?
- What is the growth of resources needed after each Release?
So, It is important to have a centralized view of a Sys to Pinpoint a source or problem.
Typically you have, let's say multiple servers running containers on them. As the user input grows, it makes sense to distribute these services individually, getting us to a microservice infrastructure. Now, if services want to connect with each other, there should be some sort of a way for them to be interconnected.
Let's say you have completed your thousand dollar Project after a day and night hard work, and another day when you woke up you have
seen that your application stopped working, some of the build components of your application or microservices got failed/stopped running and these errors are too much that you are not able to find which component or services are failing or caused the failure. Or let's say your application is responding very slowly as all the traffic is being directed to just limited servers. That is a place no one would want to be in. As debugging this manually is going to be very time-consuming, So at this place, monitoring and alerts are places an important role.
So how do you ensure that your application is being maintained properly, and is running with no downtime? We need some sort of an automated tool that constantly monitors our application and alerts us when something goes wrong (or right depending on the use case). Now, in our previous example, we would be notified when a service causes failure, and hence we can prevent our application from going down.
Prometheus is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database (allowing for high dimensionality) built using an HTTP pull model, with flexible queries and real-time alerting. Prometheus is built by SoundCloud and currently, it is a F/OSS incubated project under Cloud Native Cloud Foundation(CNCF).
Target - It is What the Prometheus Monitors, it can be your microservice, application, or docker containers.
Metric - For our target, we would like to monitor some particular things. Let's say we have some docker container(Target) running and we want to monitor the CONTAINER_MEMORY_USAGE(Metric) for every running container.
In the above, we see some important components of Prometheus Server. It consists of three parts:
Time Series Database(TSDB) - It stores the metric data. It also ingest it (append-only), compacts, and allows querying efficiently. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could be server metrics, application performance monitoring.
Scrape Engine - Pulls the metrics (description above) from our target resources and sends them to the TSDB. (Prometheus pulls are called scrapes).
Server - Used to make queries for the data stored in TSDB using a very powerful query language using PromQL.This is also used to display the metric in a dashboard using Grafana or Prometheus UI.
A More Detailed Querying Structure:
The metrics are defined with two types of major attributes
HELP to increase logs readability.
HELP: It shows the details about the metric with a description.
TYPE: Prometheus offers 4 core metric types so that we can classify different types of metrics easily and we can also create custom tags using these existing metric types also for a specific use.
- Counter: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
- Gauge: A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.For eg: CPU_MEMORY_USAGE.
- Histogram: A histogram samples observations (usual things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
- Summary: Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
The Data Retrieval Worker pulls the data from the HTTP endpoints of the targets on path
/metrics. Here we notice 2 things:
The endpoint should be exposed to the
Q. How do we make sure that the target services expose /metric & that data is in correct format?
A. Some of them expose the endpoint by default. Ones that do not, need a component to do so. This component is known as an Exporter. An Exporter does the following:
1.Fetch data from the target
2.Convert data into a format that Prometheus understands
/metrics endpoint (This can now be retrieved by the Data Retrieval Worker) For different types of services, like APIs, Databases, Storage, HTTP, etc, Prometheus has a Exporters you can use.
Let's say you have written your application in python and you want to expose it to the HTTP endpoint
/metric on your application instance so you need some client libraries or you can say an exporter which then can be used to send data to the data retrieval worker. In the official documentation of Prometheus Clients, you can get the list of all clients.
Metrics are one of the “go-to” standards for any monitoring system of which there are a variety of different types. At its core, a metric is essentially a measurement of a property of a portion of an application or system. Metrics make an observation by keeping track of the state of an object. These observations are some value or a series of related values combined with a timestamp that describes the observations, the output of which is commonly called time-series data.
Prometheus is a pull-based system that pulls data from configured sources at regular intervals.
As mentioned above, Prometheus uses a pull mechanism to get data from targets. But mostly, other monitoring systems use a push mechanism. How is this different and what makes Prometheus so special?
Q. What do you mean by push mechanism?
A. Instead of the server of the monitoring tool making requests to get the data, the servers of the application push the data to a database instead.
Q. Why is Prometheus better?
A. You can just get the data from the endpoint of the target, by multiple Prometheus instances. Also note that this way Prometheus can also monitor whether an application is responsive or not, rather than waiting for the target to push data.
(Check out the official comparison documentation)
NOTE: But what happens if the targets don't give us enough time to make a pull request? For this, Prometheus uses Pushgateway. Using this, these services can now push their data to the Data Retrieval Worker instead of it pulling data like it usually does. Using this, you get the best out of both ways!
Till now we have got to know What is Prometheus? and How the Prometheus Architecture? look like, Now let's see How we can set up Prometheus Locally?
Q. When you define what targets you want to collect data from in the file, how does Prometheus find these targets
A. Using the Service Discovery. It also discovers services automatically based on the application running.
The most important file in Prometheus is the config(yml) file, which is Prometheus.yml here we define all set of instruction that should build the Prometheus Server:
global: # How frequently to scrape targets by default. [ scrape_interval: <duration> | default = 1m ] # How long until a scrape request times out. [ scrape_timeout: <duration> | default = 10s ] # How frequently to evaluate rules. [ evaluation_interval: <duration> | default = 1m ] # The labels to add to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: [ <labelname>: <labelvalue> ... ] # File to which PromQL queries are logged. # Reloading the configuration will reopen the file. [ query_log_file: <string> ] # Rule files specify a list of globs. Rules and alerts are read from # all matching files. rule_files: [ - <filepath_glob> ... ] # A list of scrape configurations. scrape_configs: [ - <scrape_config> ... ] # Alerting specifies settings related to the Alertmanager. alerting: alert_relabel_configs: [ - <relabel_config> ... ] alertmanagers: [ - <alertmanager_config> ... ] # Settings related to the remote write feature. remote_write: [ - <remote_write> ... ] # Settings related to the remote read feature. remote_read: [ - <remote_read> ... ]
(Check the official documentation for configuration)
A More Simplified Prometheus.yml!!!
global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first.rules" # - "second.rules" scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090']
scrape_intervaldefines how often Prometheus is going to collect data from the targets mentioned in the file. This can of course be overridden.
rule_files- This allows us to set rules for metrics & alerts. These files can be reloaded at runtime by sending
SIGHUPto the Prometheus process. The
evaluation_intervaldefines how often these rules are evaluated. Prometheus supports 2 types of such rules:
- Recording Rules - If you are performing some frequent operations, they can be precomputed and saved in as a new set of time series. This makes the monitoring system a bit faster.
- Alerting Rules - This lets you define conditions to send alerts to external services, for example, when a particular condition is triggered.
scrape_configs- Here we define the services/targets that we need Prometheus to monitor. In this example file, the
prometheus. Meaning that it is monitoring the target as the Prometheus server itself. In short, it will get data from the
/metricsendpoint exposed by the Prometheus server. Here, the target by default is
localhost:9090which is where Prometheus will expect the metrics to be, at
Prometheus has an AlertManager that is used to set Alerts and send them using Emails, webhooks, Slack, and other methods. As mentioned above, the Prometheus server uses the Alerting Rules to send alerts.
Where is the data stored?
The data collected by the Data Retrieval Worker is stored in a TSDB and queried using PromQL query language. You can use a Web UI to request data from the Prometheus server via PromQL.
For the demonstration purposes we will use Docker to make things really easy and reproducible anywhere. Here is a simple Dockerfile which sets up the stage for us.
It starts off with Ubuntu 18.04 (bionic) official image
Installs some of the tools we will be using like wget, screen and vim
It downloads the latest binary releases for Prometheus, node_exporter, and alert manager.
It also downloads Grafana which will be used later for Visualization
Exposes the default ports from the respective services
FROM ubuntu:bionic LABEL Name="Ritesh" Mail="email@example.com" RUN apt-get update \ && apt-get install -y wget \ && apt-get install -y screen \ && apt-get install -y vim WORKDIR /root RUN wget -nv https://github.com/prometheus/prometheus/releases/download/v2.22.2/prometheus-2.22.2.linux-amd64.tar.gz RUN wget -nv https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz RUN wget -nv https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz RUN wget -nv https://dl.grafana.com/oss/release/grafana-7.3.3.linux-amd64.tar.gz # node_exporter expose port EXPOSE 9100 # prometheus server expose port EXPOSE 9090 # grafana EXPOSE 3000 # alertmanager EXPOSE 9093
Now, we will create a simple docker-compose file to prepare our docker image to run, only thing it does is, it gives us a handy name to work with which will expose multiple ports to the host for us. With this, we can access them from the host machine. I generally prefer docker-compose a lot than writing long docker run commands with options.
version: '3' services: prometheus_demo: build: . ports: - "9100:9100" - "9090:9090" - "3000:3000" - "9093:9093"
Once these two files are there, you can go to the folder containing them and run the service from
docker-compose in interactive mode. Note that
--service-ports is a very important option, it allows us to do the port binding right (which is disabled by default).
docker-compose run --service-ports prometheus_demo
Once you run the command, you will see a bunch of outputs corresponding to the build steps in Dockerfile.
After all the steps are done you will be thrown into an interactive shell inside a docker container. If you want to check the no of the running container with a
prometheus_demo image name you can use
docker-compose ps or
docker ps -a you will get the single container running multiple services like Grafana, Prometheus, node_exporter, and alert manager on a default port or ports mentioned in Dockerfile.
Now let's extract all the Packages which we have downloaded using Dockerfile.
tar xvf prometheus-2.22.2.linux-amd64.tar.gz
tar xvf node_exporter-0.18.1.linux-amd64.tar.gz
tar xvf grafana-7.3.3.linux-amd64.tar.gz
Similarly, you can extract other packages like alertmanager.
The next step is to move to the package directory. I am showing with Prometheus you can follow these steps with other packages also.
MOVE TO DIREC
Now you can run the Prometheus file and make your Prometheus Monitoring System running.
Now you can see a msg in the logs that
Server is Up and ready to receive web request.
Your Server is running it on
localhost:9090, you'll get the following Prometheus UI Dashboard that you can now configure:
In the above Prometheus UI Dashboard we're monitoring the Docker Containers.
Now you can connect your Prometheus Server to node_exporter(It is an exporter used to get a Linux server health by exposing
/metric endpoint to Data Retrieval Worker) by mentioning the other job_name in your prometheus.yml file which we have discussed in Prometheus.yml file section,node_exporter will fetch you the metrics like
CPU_LOAD in your Prometheus dashboard and you can also see the node_exporter
/metric HTTP endpoint on
You can set up Grafana by following the below steps
You will find your Grafana Server Running at
Now you can fetch the metrics and see the beautiful visualization in Gafana Dashboard.
- Monitor Prometheus Health
- Monitor HOST Machine Health
In the next blog, we will Discuss more Prometheus Advanced Visualization with Grafana and also monitors K8's using Prometheus!