marcmacgonagle

Posted on Feb 1, 2020

Black Swans

#productivity #codequality #devops

As former US Secretary of Defense Donald Rumsfeld famously noted there are known knowns, known unknowns and unknown unknowns. It's an unfortunate part of human nature that we ignore unknown unknowns when we are planning a course of action. Developers, being human, are guilty of this too.

In particular I'm talking about things like:

Oh, the value for income has to be an integer greater than zero because the GUI enforces this so we can assume that in the client cache loading logic.

Except, of course, for the two month period last year when null was allowed and now none of those clients have loaded into the cache

Oh, it's fine I've looked at everything that calls the trade table and nothing actually uses the validator3type column so it's fine to delete it.

Except, of course, for the trigger that exists only in production and one uat database that hasn't been used in two years. And now the reporting team's report is broken.

Nassim Nicholas Taleb called these sorts of situations Black Swans after the fact that Europeans used to think all swans were white until they encountered a black one in Australia. In particular, he talks about how these unexpected events have an outsize impact. This is certainly true in coding: The most disruptive defects are often caused by something you weren't aware of rather than something you overlooked.

When these defects occur I think the instinct of developers is to blame ourselves for not predicting the future. And sometimes, of course, it is our fault and if there are lessons to be learned about better planning and analysis then we should definitely learn them. By the way, when I say planning I also mean testing. I'm assuming that everything you've planned for has some form of automated test, or at least a manual test that gets run on a regular basis.

However, the thing that Taleb notes is that we often retrospectively predict what happened and ascribe the defect solely to poor planning as opposed to recognising the need to spend time making ourselves prepared for unexpected outcomes. For instance, NASA is regularly held up as a paragon of defect free coding and is famous for the rigor of it's testing, but NASA also builds in lots of redundancy and trains their astronauts to deal with unexpected outcomes - during the descent onto the Lunar surface Neil Armstrong noticed that the automated guidance system was sending the landing craft to a crater filled with boulders and had to override it with seconds to spare.

So, what can developers do to negate the problem of Black Swans? Here are a few techniques that I've found helpful. Some fall into the pre-deployment phase and some to the post-deployment phase.

Avoid default behaviour

Our code should make explicit which states it expects. In a simple Java example the following code has a clear problem

    public static boolean isPositive(Integer i) {
        if (i != null && i > 0) {
            return true;
        }
        return false;
    }

a better way would be

    public static boolean isPositive(Integer i) {
        if (i != null && i > 0) {
            return true;
        }
        if (i != null && i <= 0) {
            return false;
        }
        throw new IllegalArgumentException(i + " is not a valid value");
    }

`
Now, I know that in the above example you could put the null check first and in practice that's what I'd do. However, the purpose of the second example was to demonstrate how we should explicitly check that value is in one of a set number of options - in this case positive, negative or zero - and raising an exception if it isn't. The reason we can put the null check first is that the Java ensures that the input can only be null, positive, negative or zero but in general our language won't be able to ensure this for us. This is particularly true when determining the state requires multiple fields.

Obviously, throwing an exception isn't going to be a general purpose solution and what you would want to do in this circumstance will depend on the system you are working on. But at the very least you need to know about it through a log statement or an alert of some kind. The thing you're trying to avoid is having something slip through as a default case that you hadn't planned for.

Code review

I know that code reviews have a reputation for getting bogged down in flame wars about topics like style and naming conventions. This is unfortunate as they can provide great value if done correctly. One of their key contributions is they help a developer overcome some of their blinds spots.

A defect only gets through a properly run code review if both (or more) developers share the same blindspot. The more developers you have the more likely somebody is aware of the potential Black Swan. For instance, maybe the other developer worked on or with one of the team that consumes your service and has more knowledge of how they use it.

Run queries in prod beforehand

If you've made assumptions in your testing validate these in a production environment by running some queries (e.g. sql, splunk).

Logging and state persistance

There's nothing worse than knowing something has gone wrong but not being able to tell what. Make sure the logging and state persistance you've added allow you to piece together the full picture.

Shadow run or replay

The closer you can get to running your code in a live environment without it actually being live the better. Shadowing - where the production data is sent through both the original code and the code under test in a live system - is one approach that those with a fairly advanced deployment set up can use. For a simpler, but still pretty effective solution you can recreate the inputs to production from resources like logs and databases and replay these in a test environment.

Switches

Depending on your deployment processes you may only be able to deploy at certain times or the process may be labour intensive. A useful technique here is to put your changes behind a switch that can be changed easily in a running system. For instance a control panel for the application or even a config table in the database. You can now turn your feature on/off at a time that carries less risk. Crucially you can turn it off quickly if something doesn't look right.

Only go live for a subset

An extension to the switch idea is add granularity so that you only turn on the feature in certain circumstances. Low risk users or clients for example.

Have queries ready to go for go-live

Even if you can't find a defect before go live it's always best if you can find it before anyone else does. It gives you a headstart on finding a fix and managing the situation.

Datafixes in prod

This is definitely the place you don't want to be. But just because you don't want to be here doesn't mean you shouldn't have some sort of plan for what happens if you arrive here. For instance, although you can you use your switches to turn off the new functionality you may still have database rows created by the defective process that need to be changed or deleted. The key thing to remember is you should know what needs to be done after your fix. Do messages need to be resent or caches reloaded?

Fix Forward

This is the riskier cousin of just turning off the switch. For me this comes down to how much faith you have in your CI process. I suppose it fits into the old 'Work fast and break things' mentality. Definitely the ability to do this offers you much more flexibility so it's (yet) another reason to improve your CI and testing capabilities

You've probably noticed that I haven't included things like static analysis tools (e.g. FindBugs) as I think these mostly find defects that you should have been aware of and I was trying to cover defects that hit you out of left field. However, these are good tools to use too.

Looking forward to hearing your thoughts and suggestions. What do you do to deal with unknown unknowns?

DEV Community

Black Swans

Avoid default behaviour

Code review

Run queries in prod beforehand

Logging and state persistance

Shadow run or replay

Switches

Only go live for a subset

Have queries ready to go for go-live

Datafixes in prod

Fix Forward

Top comments (0)

Read next

Decoding the Design: The Evolution of the Amazon Web Services Logo

From 41 Minutes to 8 Minutes: How I Made Our CI/CD Pipeline 5x Faster

From a Content Creator to an AI Tool Maker

Docker Networking: A Comprehensive Guide