I recently had to do a proposal for our Kubernetes monitoring solution. Kubernetes is a tricky beast, especially with Prometheus because there is so much information. I was daunted by the amount of effort and understanding I'd need to begin on the task, and it took me a few days to actually open the ticket.
After reading through a lot of articles on Kubernetes monitoring, I knew the answer to my question wasn't going to be a simple one. As I stared at the space above my monitor, pondering the question, a thought struck me. It was simple, but what if I just wrote down all the questions I needed answered by my monitoring and then got all the appropriate metrics. In every job I've ever had, it's always been "monitor x, y, and z", but never "tell me when this isn't working properly."
I started off my question chain as follows:
- Can my pods be scheduled?
- Are there nodes available to schedule pods?
- Are there resources available on my nodes?
- Are my pods running as expected?
If I can answer each of these questions in the affirmative, then I can generally expect that a service is running correctly. Likewise if I ask questions for each layer (hardware, control plane, nodes, services, ingress, etc) and those answers are in the negative, I can understand cascading dependencies on my compute, advising customers and mitigating an outage as much as possible.
Actually getting the metrics to answer these questions was super easy. I just opened up the Kubernetes Dashboard and checked the places where I could check if I was manually fixing the issues. Then, I simply noted them and created Prometheus Alarms for them. It's that easy.
Have a good process, and good results will follow.