DEV Community

Cover image for Prometheus metrics collection from distributed sources; using TLS to secure Prometheus remote write via Vector
Ahsan Nabi Dar
Ahsan Nabi Dar

Posted on • Updated on

Prometheus metrics collection from distributed sources; using TLS to secure Prometheus remote write via Vector

Prometheus is a great tool to collect metrics. Allowing to collect metrics published by applications or published by applications on behalf of other applications. It is fairly straightforward to start with Prometheus when scraping metrics for private/locally deployed apps and publishing to a similarly hosted Grafana instance. As all things are in local scope you don't have to write anything to remote i.e. push events to a remote prometheus for grafana to use. Rather your grafana can pull events from the prometheus endpoints at scrape intervals.

Local Prom/Grafan

A simple prometheus scrape config to scrape haproxy and openresty that would get you up and running for metrics published by those services would be

scrape_configs:
  - job_name: 'haproxy'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 30s
    metrics_path: /metrics
    static_configs:
      - targets: ['haproxy:18081']
        labels:
          group: 'haproxy'

  - job_name: 'openresty'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 30s
    metrics_path: /metrics
    static_configs:
      - targets: ['openresty:9145']
        labels:
          group: 'openresty'

Enter fullscreen mode Exit fullscreen mode

But when you have more than 1 host and need to collect metrics from multiple locations things start to get complex. As your prometheus deployment is deployed internally you move toward a push model over the pull model to a remote grafana distribution

Distributed Grafan cloud

To have a single hosted Grafana you can use grafana cloud's free offering which is generous for hobby projects with 10,000 metrics/month. I started with that and soon ran out of the quota with multiple hosts. The sweet spot for me was to scrape at 3600 sec i.e. metrics visibility once every hour, yes I have more than 1 personal server ;). So I moved to host my grafana instance. We will come to that later once we lay out the prometheus workings and why we need TLS and vector if prometheus can already do what is required by grafana.

With grafana cloud when you add prometheus connection it will give you an API endpoint, username and password. That sound ok? If you have a long enough password it will take someone significant effort to brute force to hack your prometheus endpoint but still not very reassuring. To make your local proemtheus instance send data to a remote connection is quite simple all you need to do is add a remote config to your current config and data starts flowing.

remote_write:
- url: PROM_REMOTE_WRITE_ENDPOINT
  basic_auth:
    username: GRAFANA_PROM_USERNAME
    password: GRAFANA_API_KEY

scrape_configs:
  - job_name: 'haproxy'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 30s
    metrics_path: /metrics
    static_configs:
      - targets: ['haproxy:18081']
        labels:
          group: 'haproxy'

  - job_name: 'openresty'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 30s
    metrics_path: /metrics
    static_configs:
      - targets: ['openresty:9145']
        labels:
          group: 'openresty'
Enter fullscreen mode Exit fullscreen mode

A few points to note are remote write as per prometheus website can use 25% more memory with a push-based strategy but this is also the only way to publish metrics to remote sinks.

So now that we have established how to collect and publish events to a local and remote prometheus and grafana instance. Let's look at how to run a globally distributed prometheus setup same as how grafana cloud does and avoid limits that are there on the free account (but you pay for the server you run it on but there are plenty of cheap servers for personal workloads without breaking bank to signup for a SaaS service). It would be something as such from a bird's eye view, whether you use the SaaS offering or run your setup to scale and collect metrics.

remote prom

To achieve some working as grafana cloud you would need to run a prometheus instance on the remote instance along with grafana and configure your grafana connection to the local prometheus on the remote server while all other hosts will publish to the proemtheus on the remote host. Prometheus has a published endpoint to write to which you can reverse proxy if you want to but the default endpoint for Remote Write Receive is /api/v1/write which needs to be configured for authenticated access otherwise exposed it will be public by default.

To secure the endpoint and prometheus dashboard you need to define your web config yaml and pass it on to start the service with the following flags.

--web.config.file=/etc/prometheus/web.yml \
--web.enable-remote-write-receiver"
Enter fullscreen mode Exit fullscreen mode
basic_auth_users:
    <username>: <bcrypted password>

Enter fullscreen mode Exit fullscreen mode

Pick any username and set it in place of and for the password you need to bcrypt the password and replace .

So you have a setup which is as secure as grafana cloud? Maybe on paper as it uses the same authentication strategy(the only one provided by pometheus). This is not great as there is no TLS support to create a secure private connection before connecting even though you can do https on the remote write it is vulnerable to brute force and DDoS.

So let's take it a notch up and add TLS to our remote write and along the way replace prometheus on all hosts pushing data. To put that into perspective before we go into more configs it would be like so.

vector prometheus

The prometheus scrap config in vector config would translate to the following using prometheus_scrape source and prometheus_remote_write sink


[sources.prometheus_haproxy]
  type = "prometheus_scrape"
  endpoints = ["http://haproxy:18081/metrics"]
  scrape_interval_secs = 15
  instance_tag = "instance"

[sources.prometheus_openresty]
  type = "prometheus_scrape"
  endpoints = ["http://openresty:9145/metrics"]
  scrape_interval_secs = 15
  instance_tag = "instance"

[transforms.tag_prometheus_haproxy]
type = "remap"
inputs = ["prometheus_haproxy"]
source = '''
.tags.job = "haproxy"
'''

[transforms.tag_prometheus_openresty]
type = "remap"
inputs = ["prometheus_openresty"]
source = '''
.tags.job = "openresty"
'''

[sinks.prometheus_remote_write]
  type = "prometheus_remote_write"
  inputs = ["tag_prometheus_haproxy", "tag_prometheus_openresty"]
  endpoint = "${PROM_REMOTE_WRITE_ENDPOINT}"
[sinks.prometheus_remote_write.auth]
  strategy = "basic"
  user = "${PROM_REMOTE_WRITE_USERNAME}"
  password = "${PROM_REMOTE_WRITE_PASSWORD}"
[sinks.prometheus_remote_write.buffer]
  type = "disk"
  when_full = "drop_newest"
  max_size = 268435488
[sinks.prometheus_remote_write.tls] 
  ca_file = "/opt/vector/tls/server_cert.crt"
  crt_file = "/opt/vector/tls/client_cert.crt"
  key_file = "/opt/vector/tls/client_key.key"

Enter fullscreen mode Exit fullscreen mode

With this, you have replaced your local prometheus and now need to set up a remote vector instance to receive data only from sources that not only have the credential but can also verify themselves and are allowed to establish a connection to transmit data on it.

To set up vector to receive prometheus remote write and be able to write to a prometheus remote write it requires setting up both source and sink as prometheus_remote_write

[sources.prometheus_remote_receive]
  type = "prometheus_remote_write"
  address = "0.0.0.0:<port>"
[sources.prometheus_remote_receive.auth]
  username = "${PROM_REMOTE_WRITE_USERNAME}"
  password = "${PROM_REMOTE_WRITE_PASSWORD}"
[sources.prometheus_remote_receive.tls]
  enabled = true
  ca_file = "/opt/vector/tls/client_cert.crt"
  crt_file = "/opt/vector/tls/server_cert.crt"
  key_file = "/opt/vector/tls/server_key.key"
  verify_certificate = true

[sinks.prometheus_remote_write]
  type = "prometheus_remote_write"
  inputs = [ "prometheus_remote_receive"]
  endpoint = "<local prometheus remote write host>/api/v1/write"
[sinks.prometheus_remote_write.auth]
  strategy = "basic"
  user = "${PROM_REMOTE_WRITE_USERNAME}"
  password = "${PROM_REMOTE_WRITE_PASSWORD}"
[sinks.prometheus_remote_write.buffer]
  type = "disk"
  when_full = "drop_newest"
  max_size = 268435488

Enter fullscreen mode Exit fullscreen mode

This completes your end-to-end collection of metrics from distributed sources and secures them beyond basic auth strategy. You can use this with any cloud setup or with a Hybrid setup to have a unified observability platform based on Grafana. Vector is a powerful tool and its prometheus sink has many multiple auth strategies including for AWS if you are using their managed option. Hope this helps you in thinking over how to collect metrics from multiple hosts and clouds and solve problems for distributed systems.

There are many more things that you would need to account for deploying this in production, which are not limited to but include having a HA deployment strategy for Prometheus, Grafana and Vector which are beyond the scope of this blog post

You should consider managed solutions like Grafana cloud or AWS provided managed solutions for prometheus and grafana to which your vector nodes can push data and you have one less thing to worry about when stretched for resources during adopting these tools

Top comments (1)

Collapse
 
hiteshpathak profile image
Hitesh Pathak

Hey really liked this article. I am quite new to monitoring, and I am trying to set it up for a webapp. I have been looking into securing a local prometheus + remote grafana (grafana cloud) set up.

I have a few questions:

  • Is it possible to secure this set up so that the data sent to grafana cloud is encrypted in transit (that you achieve by using TLS here)

  • Grafana cloud docs use basic auth which is not good enough

  • And promethus docs are quite confusing, can't understand whether it's supported or not

I currently do not have the resources to have another server to deploy a setup like yours, do you have any pointers as to how I can solve this problem...