DEV Community

Aleksi Waldén for Polar Squad

Posted on • Edited on

Prometheus Observability Platform: Alerts

With Prometheus, we can use PromQL to write alert rules and evaluate them using the given evaluation rules and intervals. Alerts have an evaluation period: if an alert is active for the duration of the evaluation period, then it will fire. Prometheus is usually bundled with a component called Alertmanager, which is used to route alerts to different receivers such as Slack and email. Once an alert fires, it is sent to the Alertmanager which uses a routing table to find out if the alert is to be sent to a receiver, and how to route it.

Prometheus alerts are evaluated against the local storage. With VictoriaMetrics, we can use the vmalert component to evaluate alert rules against the VictoriaMetrics long-term storage using the same PromQL syntax as with Prometheus. It is tempting to write all the alerting rules in VictoriaMetrics, but depending on the size of the infrastructure we might want to evaluate some rules on the Prometheus servers where the data originates from, to avoid overloading VictoriaMetrics.

Alert rules can be very complex, and it is best to validate them before deploying them to Prometheus. Promtool can be used to validate Prometheus alerting rules and run unit tests on them. You can implement these simple validation and unit testing steps in your continuous integration (CI) system.

A good monitoring platform enables teams to write their own alerts against the metrics stored in the long-term storage. We can do this in a mono-repository or multi-repository fashion. With a mono-repository, we have all the infrastructure and the alerting defined in the same repository and pipelines delivering them to servers. A multi-repository approach would set up a separate repository for the alerts, where we define the alerting rules using PromQL, and add validation and unit tests.

The main benefit of the multi-repository approach is reduced cognitive load. The contributors do not see or need to be aware of anything else than the alert rules. This also eliminates the possibility of introducing bugs into the underlying infrastructure. The downside of this approach is tying the separated alerting configuration back to the Prometheus server.

Terraform can be used to set up the repository used for alerting as a remote module and thus pull the alerting rules into the server when deploying the server. With a mono-repository, we can more easily tie the alerts into the Prometheus server, but if we are using Terraform then we need to either split the state of the alerts or accept that the contributors might affect more resources than just the alerts which might also cause more anxiety to the contributors.

Demo

This example assumes that you have completed the following steps as the components from those are needed:

Prerequisites:

Figuring out suitable metrics for alerts can be hard. The awesome-prometheus-alerts website is an excellent source for inspiration for this. It has a collection of pre-made alerts using the PromQL syntax. For example, we can set up an alert for crash-looping Kubernetes pods, with the alert named KubernetesPodCrashLooping.

Below there is an example unit test for the KubernetesPodCrashLooping alert. First, we want to simplify the alert a little and add some required blocks for promtool to be able to validate the rule. This file is saved as kube-alerts.rules.yml:

groups:
  - name: kube-alerts
    rules:
    - alert: KubernetesPodCrashLooping
      expr: increase(kube_pod_container_status_restarts_total[5m]) > 2
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Pod {{$labels.namespace}}/{{$labels.pod}} is crash looping
Enter fullscreen mode Exit fullscreen mode

We can use the command promtool check rules kube-alert.rules.yml to validate the rule. If everything is OK, the response looks like this:

promtool check rules kube-alert.rules.yml
---
Checking kube-alert.rules.yml
  SUCCESS: 1 rules found
Enter fullscreen mode Exit fullscreen mode

To write a unit test for this alert, we create a file called kube-alert.test.yml:

rule_files:
  - kube-alert.rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m

    input_series:
      - series: kube_pod_container_status_restarts_total{namespace="test-namespace",pod="test-pod"}
        values: '1+2x15'

    alert_rule_test:
      - alertname: KubernetesPodCrashLooping
        eval_time: 15m
        exp_alerts:
          - exp_labels:
              severity: warning
              namespace: test-namespace
              pod: test-pod
            exp_annotations:
              summary: Pod test-namespace/test-pod is crash looping
Enter fullscreen mode Exit fullscreen mode

So we are expecting an increase of over 2 in the kube_pod_container_status_restarts_total time series within 5 minutes, and that this increase is active for at least 10 minutes. In the summary field, we expect to receive namespace and pod labels and to a severity label with the value “warning”.

To write a test for this rule we need to create an input series that can trigger this rule and pass all the labels needed for the summary field. Because our evaluation time is 15 minutes we need at least 15 entries into our series. The syntax ‘1+2x15’ adds 2 to the previous number 15 times to create a time series. We also pass the required namespace and pod labels and write the expected summary field response.

To run the unit test we use the command promtool test rules kube-alert.test.yml which will return the following response if all went well:

promtool test rules kube-alert.test.yml
---
Unit Testing:  kube-alert.test.yml
  SUCCESS
Enter fullscreen mode Exit fullscreen mode

Next we need to deploy vmalert so that we can evaluate alert rules against the data in the long-term storage.

First we have to convert our alert rule into a format that works with Helm. The problem with promtool and Helm charts is that the groups: section is required in both of them, so we need to remove it from the alert we have created, but if we remove it then promtool no longer works. There are multiple ways to handle this, for example the Terraform trimprefix() function which can be used to remove the groups: section from the alert rules. For this use case we are going to use a monstrous one-liner to remove the groups: section, convert the output into json and then output it into a single line so we can pass it to Helm:

cat kube-alert.rules.yml | sed '/groups:/d' | yq -o=json | jq -c
Enter fullscreen mode Exit fullscreen mode

This will get us the following json one line string:

[{"name":"kube-alerts","rules":[{"alert":"KubernetesPodCrashLooping","expr":"increase(kube_pod_container_status_restarts_total[5m]) > 2","for":"10m","labels":{"severity":"warning"},"annotations":{"summary":"Pod {{$labels.namespace}}/{{$labels.pod}} is crash looping"}}]}]
Enter fullscreen mode Exit fullscreen mode

Now we can deploy the vmalert Helm chart:

helm install vmalert vm/victoria-metrics-alert --namespace victoriametrics --set 'server.notifier.alertmanager.url=http://localhost:9093' --set 'server.datasource.url=http://vmcluster-victoria-metrics-cluster-vmselect:8481/select/0/prometheus' --set 'server.remote.write.url=http://vmcluster-victoria-metrics-cluster-vminsert:8480/insert/0/prometheus' --set 'server.remote.read.url=http://vmcluster-victoria-metrics-cluster-vmselect:8481/select/0/prometheus' --set-json 'server.config.alerts.groups=[{"name":"kube-alerts","rules":[{"alert":"KubernetesPodCrashLooping","expr":"increase(kube_pod_container_status_restarts_total[5m]) > 2","for":"10m","labels":{"severity":"warning"},"annotations":{"summary":"Pod {{$labels.namespace}}/{{$labels.pod}} is crash looping"}}]}]'
Enter fullscreen mode Exit fullscreen mode
  • server.notifier.alertmanager: We are using a placeholder value here for now, as we cannot install the chart without providing some value here
  • server.datasource.url: Prometheus HTTP API compatible datasource
  • server.remote.write.url: Remote write url for storing rules and alert states
  • server.remote.read.url: URL to restore the alert states from

We can now port-forward the vmalert service and navigate to the web UI in http://localhost:9090

alert

We have now achieved creating an alert rule, writing a unit test for it, and setting up vmalert with the alert rule defined.

Next part: Prometheus Observability Platform: Alert routing

Top comments (0)