In addition to the typical metrics that you may think of as being part of a service, that is CPU, instance count, disk, etc… there is another class of metrics data that tells you about the potential reliability of your service.
These are, MTTF, MTTR, MTTD, and MTBF. These are Mean Time To Failure, Mean Time To Resolve, Mean Time To Detection, and MTBF.
These are all metrics that cannot be observed directly. That is you cannot take a single data point on a graph and say this is our MTTF. That’s because it takes at least two data points and must be computed.
Further, you need to decide on what timeline you’ll compute this. Say over the last year? Six months?
You may have seen a variety of acronyms associated with these metrics, here are some that you’ll encounter:
MTTF - Mean Time To Failure. This is the average of how long between when something goes down. Since its of course up in between failures, this is often just “uptime” averaged over a period. From reliability engineering, this is intended to be used for systems and components that can’t be repaired and instead or just replaced.
MTTR - Mean Time To Repair. This is the average of how long it takes for things to come back up once they are down. This time period represents all the work of repairing the component of the system.
MTTD - Mean Time To Detection. This is the average of how long it takes to realize something is down. So for example if something went down at 1200, but no one noticed or was alerted until 1210, then the time to detection was 10 minutes. If you had multiple incidents over time, you could use the data points to average this.
MTBF - Mean Time Between Failure. Similar to MTTF, but for repairable items.
A warning about incident metrics
I primarily include these definitions so that you can be aware of what they are. It can be helpful/important to know about the existence of these metrics as you’ll often hear their use encouraged.
It’s also important to know that by using these metrics, you can make yourself blind to some more important things.
Most of these metrics come from reliability engineering, but not software engineering. That means the physical world. Even there, it can be argued that many of these metrics aren’t appropriate. If one motor started rusting and lead to failure, would you expect others? Well, it depends on the conditions doesn’t it?
When we talk about people and their behavior in complex situations such as incidents and outages, these metrics become less and less relevant.
Putting too much effort or thought into these metrics whispers the lie that all incidents are the same and if you can control some of these factors than you can improve your incident response.
The problem is this isn’t true. At the very least it’s backwards. Fixing many other things may help these metrics improve. At the worst, focusing on these will keep you from ever asking the right questions, and keep you from getting the right answers.
So how do you begin improving the things that drive these metrics?
- Ask questions
- Understand that these metrics will never tell you the truth.
You can lay groundwork in a similar manner as other disaster planning, things happen that you don’t expect. All you can do is be well prepared for that.
- Plan for what to do when a team member doesn’t know.
- Plan for what to do when things are unknowable.
- Give your team outlets to talk to you about the process.
Focus on things you can control, like how soon you can detect an incident. Then ask questions about that number.
Questions you might wish to answer/know about your incidents and your team:
- Is this an incident type we’ve seen before?
- Is this an incident type no one has seen before?
- Were docs available for this type of outage?
- Did those docs clearly outline correct action?
- How was the incident responder feeling?
- Overworked?
- Underslept?
- Is the first incident they’ve dealt with today/tonight?
- The 50th?
- Did the incident responder have the resources they needed and feeling that they could use them?
- You may be surprised to learn that simply saying “you can do this” such as “you can escalate” or “you can restart a service” often isn’t enough.
- Especially if they’ve been yelled at before or the culture makes them hesitant to pull that lever
What do you think? Leave a Comment. Click here if you want to see more like this: https://thaiwood.io/DevTo
Top comments (0)