DEV Community

punch
punch

Posted on

Testing For Success vs. Failure

When you build something, and tweak it to satisfy all of the scenarios it can cover, how/when/how much do you test it for failure?

Let's say that I build a monitor to watch the ratio between two specific metrics, and I want it to alert me when that ratio drops below 0.8, rather than 1 (indicating that there is no issue), or 0.9 (indicating that we might have something righting itself, i.e. an autoscaling host being killed off as it's no longer needed).

I've built this monitor and tweaked the thresholds based on historical examples of:

  1. Times when we wanted to be alerted, based on what was going on, and what the ratio looked like at that time
  2. Times when we expect the ratio to not be 1, but we don't need to alert, as we have scheduled a change during that time period

I've researched this, tested in, even did some new example tests of #1 and #2. Based on everything I've tested thus far, the new monitor I've built satisfies everything and would have alerted on all times when we wanted it to, and would have ignored all of the times we wanted it to ignore the metric ratio. I present the results of my testing, my research, my reasonings, and the monitor, to my manager, who says:

"You need to come up with an example of where this monitor fails."

Is he right?

Remember, I have:

  • tested different metrics, and combinations/ratios, to find the optimal way to monitor these scenarios
  • tweaked my thresholds to satisfy when I do and do NOT want the monitor to alert us

Testing for unknown unknowns is always difficult. In this case, I'm being asked to make a monitor that is completely perfect, and will not need to be tweaked in the future even if our infrastructure changes.

Can/should this be done? How/why/why not?

Top comments (2)

Collapse
 
itsasine profile image
ItsASine (Kayla)

will not need to be tweaked in the future even if our infrastructure changes

This alone is an obvious flag your requirements can never be met. It's not unreasonable that a case like

Times when we expect the ratio to not be 1, but we don't need to alert, as we have scheduled a change during that time period

will happen again in the future, but if you add more stuff to the infrastructure, your metrics will be down longer and thus hit the current threshold.


Ignoring the unreasonable demands:

You tested

  1. An alert you wanted alerted (.8)
  2. An alert you didn't want alerted (.9)
  3. No alert (1)

and used existing data points to verify them. The monitor "failing" would either be an alert when you don't want one (which you covered) or no alert when you need one, but aside from guaranteeing the monitor will never crash, I have no clue what he expects. Through your testing, you found the spot that, as long as it's running, you'll get the expected results based on prior knowledge.

Collapse
 
punch profile image
punch

I've had a lot of problems with this manager; they just go on for so long that, at a point, I wonder if I'm actually getting things wrong and it's not just him negating everything I do, so I appreciate the assessment.

Also, to clarify, the "will not need to be tweaked in the future" demand was not explicit, but implied. The request was "find a case where it doesn't work, and fix that", but with no indication of:

  1. whether if I find a single case, and fix it, it's done
  2. whether or not this is just an endless whack-a-mole problem

I am glad that I'm the only one who finds this vague request to be unreasonable :)