Why is Prometheus Pull-Based?

#observability #prometheus

Let's get the short answer out of the way: The creators perceive pull as "very slightly better" (i.e. they wanted to) taken from some excerpts of "The New Stack" podcast episode here.

Maybe it was my experience at New Relic, or installing one of many agent based solutions, such as, but it's easy to forget that despite the oddity of a solution that scrapes html pages to gather low level metrics data on your system, Prometheus is not the only pull-based monitoring solution. For example, SNMP and JMX are pull based, and even Nagios often operates via a pull model.

To not bury the lede: Before I started my research I thought there were more important considerations to make when choosing a monitoring system, and the research I conducted to complete this artcile did not change that. But, understanding the methodology my system uses; in the case of Prometheus, that means knowing that it is a pull oriented system; is useful for troubleshooting and understanding how it fails. It is in fact essential to understand the difference in order to understand what the most basic of metrics is telling you: Is the monitored service up or down. Prometheus is effective at this since it is probing the system to get metrics. A server in a push based system may not know if the system is down, or if

What is Prometheus and what does it mean for it to be pull based?

There are plenty of fancy definitions for Prometheus out there, but I'm going to use my own words because I think this is a more practical description:

Prometheus is a open source metric and alerts system that stores time series data it gathers, along with metadata in the form of labels, by scraping specially formatted HTML files from it's targets. Prometheus is hierarchical. You can have multiple prometheus servers all gathering their own metrics data and writing that data upstream to a master aggregator. Commonly a visualization tool like Grafana is used to visualize the metrics.

Being pull-based means that the prometheus server pulls metrics from targets (your infrastructure and applications) rather than infrastructure and applications pushing metrics data to Prometheus.

What does Prometheus documentation say about why it is pull-based?

The official FAQ says:

Pulling over HTTP offers a number of advantages:

-   You can start extra monitoring instances as needed, e.g. on your laptop when developing changes.
-   You can more easily and reliably tell if a target is down.
-   You can manually go to a target and inspect its health with a web browser.

Overall, we believe that pulling is slightly better than pushing, but it should not be considered a major point when considering a monitoring system.

This description finds me wanting, but fortunately we can find a bit more information on the "Pull does not scale, or does it" page on the Prometheus website. One is that since your monitoring system is pulling data rather than being pushed, it is much less likely to be overwhelmed by misconfigured agents pushing to much data.

Another reason is that a pull based system requires you to know the "good state" of your environment. While a push based system does not need to use service discovery to discover which systems it should be monitoring, if you do not do so, it is harder to tell if a system is down for an outage, or has been decommissioned.

What can we infer/ what does analysis tell us/what do I think.

Based on my experience using and readings around prometheus, I can highlight a few additional pros and cons to pull the pull based metrics in Kubernetes.

Pros

Applications that are being monitored don't have to cache old data to resend in the case that the server goes out. Alternatively, Prometheus loses data if the server goes down and it is not actively polling.
Central administration of configuration settings/pull intervals. You can quickly change intervals if your system is getting overwhelmed, and you don't have to wait for a potentially expensive application change to do so.

Cons

Requires a service discovery mechanism. Since Prometheus has to pull from your applications, it has to know where your applications are. Common service discovery mechanisms include Kubernetes, ec2 APIs and static configurations, and are listed on (this page)[https://prometheus.io/docs/prometheus/latest/configuration/configuration/]
Endpoints must be accessible from the server. This can be mitigated by colocating the prometheus server with the services (for example on a private kubernetes cluster) and then pushing up to the central server)
HTTP Can timeout especially on large exporters. I've seen VMWare or AWS exporters that export hundreds of thousands of metrics, which requires high HTTP Timeouts.

On Push-based metrics with prometheus

You can use push-based metrics with Prometheues. It's not recommended, but the (push gateway)[https://prometheus.io/docs/practices/pushing/] will allow you to do that. "Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service)." I'll let you read the rest on the website.

One point to note however, is that the push gateway does not push timed events to the prometheus backend. it sets up an endpoint for prometheus to scrape

Summary

Overall, I don't think pull vs push means that much to the end user/admin. For sure there are lots of important reasons to understand how your monitoring system works to help users set it up, and to troubleshoot problems as they arise. As previously stated, there are implications for how the monitoring system determines if a resource is up or down.

References

https://www.alibabacloud.com/blog/pull-or-push-how-to-select-monitoring-systems_599007
- 1. This document has a nice table in section three summarizing the different aspects of pull and push monitoring.
https://thenewstack.io/exploring-prometheus-use-cases-brian-brazil/
- Very High Level Overview, includes podcast with creators.
- 1. “It’s ‘pull can’t scale,’ ‘push can’t scale,’ or ‘They both have security problems,’ which they do, depending on the context,” Brazel noted. “From an engineering standpoint, in reality, the question of push versus pull largely doesn’t matter. In either case, there’s advantages and disadvantages, and with engineering effort, you can work around both cases. It is my belief that pull is very slightly better.”
Question about how the Telegraf exporter works: https://stackoverflow.com/questions/58574132/understand-prometheus-metrics-pulling
- 1 - Makes the point that the telegraf exporter does not push data to Prometheus it exposes a web endpoint.
- 2 - There is a question about if telegraf pushes data every 5 seconds but Prometheus scrapes every 15 seconds, what will happen? The answer is that Prometheus will get the last of the data points exposed, so Prometheus will get one out of every 3 or so metrics (15/3)
- 3 - Seems to incorrectly identifies the push gateway as event-y, but the PushGateway works the same way, data is pushed to the pushgateway then the last datapoint is exposed in the http endpoint with Prometheus Metrics. There is some back and forth about this.
1. Steve Mushero goes into detail on the implication of push vs pull as it pertains to how monitoring is configured in this blog post: https://steve-mushero.medium.com/push-vs-pull-configs-for-monitoring-c541eaf9e927

DEV Community

Why is Prometheus Pull-Based?

What can we infer/ what does analysis tell us/what do I think.

On Push-based metrics with prometheus

Summary

References

Top comments (0)