As a DevOps engineer, you should be familiar with several metrics to effectively monitor and maintain the performance and reliability of a system.
Here are the 10 most important metrics you must know:
1. Availability:
This is a measure of the proportion of time that a system is operational, taking into account both the MTBF and MTTR.
2. Mean Time Between Failures (MTBF):
This is a measure of the average time that a system operates without failing.
3. Mean Time To Repair (MTTR):
This is a measure of the average time it takes to repair a system after it has failed.
4. Error rate:
This is a measure of the number of errors that occur in a system, typically expressed as a percentage of total requests.
5. Throughput:
This is a measure of the amount of work that a system can handle, typically expressed in requests per second.
6. Latency:
This is a measure of the time it takes for a request to be processed by a system.
7. CPU utilization:
This is a measure of the amount of CPU resources that are being used by a system.
8. Memory usage:
This is a measure of the amount of memory that is being used by a system.
9. Disk I/O:
This is a measure of the amount of data being read from and written to disk by a system.
10. Network I/O:
This is a measure of the amount of data being transferred over a network by a system.
Choosing which key metrics to monitor is dependent on your company's specific challenges and needs. I hope this thread has been helpful in identifying the essential metrics.
Thanks for reading this.
If you have an idea and want to build your product around it, schedule a call with me.
If you want to learn more about DevOps and Backend space, follow me.
If you want to connect, reach out to me on Twitter and LinkedIn.
Top comments (2)
Amen brother!
When I first started working as a DBA back in 1996, the one thing I prized above all was metrics. People thought I was some sort of loon as I would bang on about metrics all the time. However after every incident meeting I was able to show why something failed and the exact point, the dev team I was working with laughed at me.
2-3 years later they were pumping metrics into every app they could as they got fed up trying to guess why and when something failed, the ones who hated it left. Productivity rose, incidents less frequent and delivery times dropped, the best part we could measure it all, plus the devs who did it got to learn how to code frontends to track and display the metrics they collected.
The key is always the right metrics, don't gather everything, get what you need, get it into a useful for format and use it, don't horde the numbers and never use them.
Absolutely. Using the correct metric for a problem is the key. Thanks for your valuable thoughts.