When you are running multiple EC2 instances or running a shared environment, it is useful to monitor the amount of CPU being used and how much memory is being used over time.
For example, if there's a lot of processing happening and CPU usage
- there could be something strange going on
- ...or it could be an expected amount of CPU usage depending on the workloads of the compute nodes
If there's a lot of memory usage that is unexpected
- it could be a memory leak
- ...or a new process launched that is consuming more resources
If there are more users logging in and starting various programs, there could be more processes running or more CPU or memory usage around particular times of the days.
It can be hard to tell what's actually happening unless you SSH into the instance and check the CPU/memory usage with
- the number of processes that are currently running
- the process that is using the most memory
- the process that is using the most CPU
- the most memory being used by one process
- the most CPU being used by one process
It exports these metrics to Prometheus so that you can monitor the changes over time.
Here's how it would be useful:
- you will be able to see that the system has no load when the most CPU or memory usage is barely hovering above 0.1%
- when a resource-intensive process is started, you will be able to see that, a-ha! It's headless Chrome starting an automated test run!
Check out the code here: https://github.com/rudolfolah/proc-watch