Monitoring and Alerting is critical for operations and cost management of business applications.
As part of the Operations team, we were required to create a mechanism to get automated alerts when the pod replicas for a deployment goes down the expected number or a pod is not available for a business critical deployment, or when the count of pod replicas goes up beyond the expected number pointing towards unusual traffic.
There is no built-in metric available for getting the count of replicas for a Kubernetes Pod in Cloud Monitoring. In this blog, we will look at how we can enable alerting policy based on the pod replicas count.
Containers are encapsulated in a Pod. We can leverage container Uptime metric for determining the count of pod replicas with below configurations
We have used resource label container_name along with the cluster name in Filter, you may need additional filters such as namespace to identify a container. The Aggregate function 'count' reduces the multiple timeseries data to a single value giving us the count of containers that are up at the given time
Here, sum is used as Aligner, but any of the available aligner options can be used as we do not care about the intermediate value
There will be no record for a container when it is not up. So we need an additional condition that will trigger an alert if a container is missing in the metric output
An alert is triggered when any of the above conditions is met