When you build something, and tweak it to satisfy all of the scenarios it can cover, how/when/how much do you test it for failure?
Let's say that I build a monitor to watch the ratio between two specific metrics, and I want it to alert me when that ratio drops below 0.8, rather than 1 (indicating that there is no issue), or 0.9 (indicating that we might have something righting itself, i.e. an autoscaling host being killed off as it's no longer needed).
I've built this monitor and tweaked the thresholds based on historical examples of:
- Times when we wanted to be alerted, based on what was going on, and what the ratio looked like at that time
- Times when we expect the ratio to not be 1, but we don't need to alert, as we have scheduled a change during that time period
I've researched this, tested in, even did some new example tests of #1 and #2. Based on everything I've tested thus far, the new monitor I've built satisfies everything and would have alerted on all times when we wanted it to, and would have ignored all of the times we wanted it to ignore the metric ratio. I present the results of my testing, my research, my reasonings, and the monitor, to my manager, who says:
Remember, I have:
- tested different metrics, and combinations/ratios, to find the optimal way to monitor these scenarios
- tweaked my thresholds to satisfy when I do and do NOT want the monitor to alert us
Testing for unknown unknowns is always difficult. In this case, I'm being asked to make a monitor that is completely perfect, and will not need to be tweaked in the future even if our infrastructure changes.
Can/should this be done? How/why/why not?