Memorious Prometheus

#eks #prometheus

Recently, we received alerts in Alertmanager, deployed with a kube-stack-prometheus Helm chart. The alert stated that 50% of the EKS endpoints for "apiserver/kubernetes" were down.

50% of the apiserver/kubernetes targets in the default namespace are down.

A brief look at Prometheus revealed that there were four(!) targets for the serviceMonitor/monitoring/prometheus-operator-monito-apiserver/0 endpoint - two were down, and two were up. Upon examining other clusters, it became clear that there are normally only two targets for each cluster.

So, it turns out that the EKS Control Plane was updated during the night, and the apiserver endpoints received new IP addresses. However, the Prometheus scraper retained old IP addresses in its database.

Solution was simple:

kubectl rollout restart statefulset prometheus-prometheus-operator-monito-prometheus -n monitoring

...and the old targets that were "down" disappeared, and the alert was resolved.