Outside of programming my other IT interests are mostly around backups, reliability, and recovery from disasters. This arises partly from my background as a sysadmin, and partly from my work on backup software. At least in theory those are all things that we should be good at in our industry, but every single place I have ever worked, and every open source project I’ve contributed to, has failings in this area. They’re mostly around single points of failure (who owns the domain for our project, for example, or what do we do if the person hosting our mailing list dies) or lack of planning or documentation (do we know how to rebuild, from scratch, our fully redundant production environment that has evolved over time?).
All of this reliability stuff is about process, not programming, so I don’t get paid for it, I just do it and my managers love me for it, and I have found that it’s often more useful to look at what other industries do than to just refer to our industry's literature around reliability and recovery. Partly because it’s more interesting, which matters when you're thinking hard about it and coming up with suggestions outside your paid hours (you don’t come across things like “a catherine wheel of shit squirting out of the pipe” in CS textbooks), and partly because concrete examples where real things broke are more easily comprehensible to a lay audience, to managers who aren’t specialists.
In the past I’ve blogged elsewhere about Disastercast a sadly defunct podcast about safety engineering whose backlog is still well worth listening to and which has lots of lessons that can be applied to software systems' quality, reliability, and testing.
This evening, locked in my Plague Bunker with no theatres or cinemas to visit, I watched an online lecture about disasters in museums given by Natasha McEnroe, keeper of medicine at the Science Museum here in London - it’s one of her war stories (from an un-named museum) that had the catherine wheel of shit. She covered all the preparatory work that I would expect - risk registers, disaster plans, checklists - and even very briefly mentioned backups for digitised records. She also mentioned rehearsals (which I don't expect, as they are so rarely carried out even though they should be commonplace), and how in her industry you can’t rehearse for much that actually happens. The leaky pipe that started spewing shit into the gallery wasn’t something you could reasonably simulate for a rehearsal.
Instead, when disaster strikes, she said that you have to rely on staff innovating on the spot, which means staff need a broad base of knowledge about your operations, they need to know that they can innovate and will be supported by their management afterwards, and people need to know how to work together to quickly arrive at a working plan. Given that the shit might hit the fan at any time, it follows that even quite junior staff need to have that support and training. I think that we can learn lessons from that. While some of our disasters can be easily planned for and rehearsed - you can rehearse bringing up a replacement call-centre, for example, or a new cluster of servers - we too have disasters we can’t rehearse such as unexpectedly hitting a 32-bit limit in an id field or a counter … in fact, most bugs.
So as well as all our documentation and backups and quick deployments and so on, I’m now going to add something else to my reliability checklist: does the organisation trust its staff, does it have their back, and do the staff know that?