First ever resilience test

#resilience #testing #chaosengineering

Last week's talk about team resilience brought back memories of my first ever automated resilience test.

Brace yourself for an old dev’s story from when lemons were the sweetest fruit available.

The scene

Between 2002 and 2006 I helped develop a pretty amazing piece of technology, internally codenamed TheBox:

TheBox in all its glory.

TheBox was designed to be the heart and brain of the house:

A wifi modem-router.
A firewall with parental control.
A multimedia player.
A TV internet browser.
A home surveillance system.
A home automation system, controlling the lights, blinds, thermostat and locks.

First production deployment

The first production deployment happened around 2005 in 100 houses as part of a new residential development somewhere in the south of Spain.

deployment here means driving 800km (500 miles), flashing 100 drives, unscrewing 100 x 8 screws, slotting 100 flash drives, screwing 100 x 8 screws, carrying and plugging each box to its house, configuring the network, modem and home automation system, and driving back those 800km.

The new home owners were thrilled with their "futuristic" houses, our finance department (one person) was ecstatic with the first proper sales, and we -- the dev team-- were amazed that things actually worked.

All was happiness until ...

An electrical storm near the residential development caused a power outage in the area.

When the power was restored, the heart and brain of most houses (aka TheBox) failed to boot.

To "quickly" restore service, we did a redeployment.

see previous side note.

We brought back a box to debug the issue and unfortunately the fix required yet another redeploy.

As the bug has demonstrated my lack of knowledge, a fix and some manual testing did not give me enough confidence to do yet another 1600km (1000 miles) trip.

Back then I was already a test automation zealot, but how could I test that TheBox could survive a power outage?

The first automated resilience test

The solution had been sitting on my desk for two years:

An X-10 switch that is able to turn on/off the power using an X-10 command.

X-10 is an ancient home automation system that uses the existing household electrical wiring to communicate between devices.

And we were building software for home automation! So I had all the pieces to write my first ever resilience test:

And the pseudocode:

forever {
    send_command(X10.power_on)
    sleep(random(2 to 4 mins))
    send_command(X10.power_off)
    sleep(10 secs)
}

To give me a little bit more confidence, the power outage would happen while TheBox was trying to update its software, so it would be more likely that the power outage would happen as TheBox was writing to disk, simulating a more risky scenario.

The test spent four days doing: