Discussion on: Safety-Critical Software: 15 things every developer should know

View post

Trevor Kletz devotes chapter 20 of What Went Wrong? Case Histories of Process Plant Disasters to "problems with computer control". It has some good general points:

Computer hardware is similar to other hardware. Once initial faults have been removed and before wear becomes significant, failure can be considered random and treated probabilistically. In contrast, failure of software is systemic. Once a fault is present, it will always produce the same result when the right conditions arise, wherever and whenever that piece of software is used.

(discussing a computer-enabled methanol spill) A thorough hazop [hazard and operability study] would have revealed that this error could have occurred. The control system could have been modified, or better still, separate lines could have been installed for the various different movements, thus greatly reducing the opportunities for error. The incident shows how easily errors in complex systems can be overlooked if the system is not thoroughly analyzed. In addition, it illustrates the paradox that we are very willing to spend money on complexity but are less willing to spend it on simplicity. Yet the simpler solution, independent lines (actually installed after the spillage), makes errors much less likely and may not be more expensive if lifetime costs are considered. Control systems need regular testing and maintenance, which roughly doubles their lifetime cost (even after discounting), while extra pipelines involve little extra operating cost.

Computers do not introduce new errors, but they can provide new opportunities for making old errors; they allow us to make more errors faster than ever before. Incidents will occur on any plant if we do not check readings from time to time or if instructions do not allow for foreseeable failures of equipment.

Blaine Osepchuk • Mar 2 '20

Good points. Thanks for sharing.

The belief that all software errors are systemic appears to be outdated.

I read Embedded Software Development for Safety-Critical Systems by Chris Hobbs as part of my research for this post. And he writes extensively about heisenbugs, which are bugs caused by subtle timing errors, memory corruption, etc. In fact, he shared a simple 15 line C program in his book that crashes once or twice every few million times it's run.

In the multicore, out-of-order executing, distributed computing era, systems aren't nearly as deterministic as they used to be.

Phil Ashby • Mar 2 '20

Great article Blaine!

In the multicore, out-of-order executing, distributed computing era,
systems aren't nearly as deterministic as they used to be.

As Dian has noted, it's a function of the complexity of such systems that produce apparently stochastic behaviour (with a little help from jitter: chronox.de/jent.html) and as you mention in the article itself, is why engineers often prefer to choose their own hardware, typically picking the simplest system that meets the processing needs then writing their own system software for it, or perhaps starting with a verified kernel (sigops.org/s/conferences/sosp/2009...) and building carefully on that.

I wonder how the safety experts feel about more nature-inspired evolutionary pressure approaches using dynamic testing (fuzzing, simian army) to harden software against bad inputs, power failures, etc? This sort of fits back in with the modern security view that failure is inevitable, what matters is how the whole system behaves under continuous failure conditions, and use of newer properties of modern software deployment to 'roll with the punches' and carry on working: slideshare.net/sounilyu/distribute...

Disclaimer: I've not worked on safety critical systems: the nearest I have been is satellite firmware (dev.to/phlash909/space-the-final-d...), which was important from a reputation and usefulness viewpoint and very much not fixable post deployment :)

Blaine Osepchuk • Mar 2 '20

Thanks, Phil.

It's perfectly acceptable to go over and above the standards and do as much fuzz/dynamic/exploratory testing as you like. I don't think you would have much luck convincing regulators that it's a good substitute for MC/DC unit test coverage. But you could capture all the inputs that cause faults, fix the errors, and then add them to your official regression test suite.

Your SlideShare link appears to be broken. I'm curious to read what was there.

I've bookmarked your satellite project post and I'll read it when I get a minute. Writing code that either flies or runs in space is on my bucket list. I'm envious.

Phil Ashby • Mar 3 '20

Ah ok, here's an InfoQ page on the topic that refers back to my favourite infosec speaker, Kelly Shortridge: infoq.com/news/2019/11/infosec-dev... The topic is Distributed, Immutable, Ephemeral (yep, DIE), using chaos engineering to defend information systems.

I get the envy reaction quite a bit :) - it was however plain luck that I was asked by a work colleague who is an AMSAT member to help out, and ended up with another friend writing firmware for a tiny CPU going to space.

Blaine Osepchuk • Mar 4 '20

Thanks for the updated link. Interesting article. I don't think the details of the technique are exactly applicable to safety-critical systems. But I have read about how complicated safety-critical systems with redundancies and fail-overs test how their systems respond to failures, disagreement in voting architectures, power brownouts, missed deadlines, etc. I suppose it would all fall under the banner of chaos engineering.

I doubt very much it was plain luck that you were asked to participate. I'm sure your engineering skills had something to do with your invitation.

Cheers.

Dian Fay • Mar 2 '20 • Edited

They're exactly as deterministic as they used to be! What Went Wrong's first edition dates to 1998 -- at that point hardware and software engineers had been dealing with race conditions, scheduling issues, and the like for decades, although Kletz doesn't get into the gory details as he's writing for process engineers rather than software developers. Computer systems have not become non-deterministic (barring maybe the quantum stuff, which I know nothing about); rather, they've become so complex that working out the conditions or classes of conditions under which an error occurs tests the limits of human analytical capacity. From our perspective, this can look a lot like nondeterministic behavior, but that's on us, not the systems.

Blaine Osepchuk • Mar 2 '20

Isn't what you are saying effectively amount to non-determinism? If your safety-critical product crashes dangerously once every million hours of operation on average for reasons you can't explain or reproduce, no matter how hard you try, isn't it hard to say that it's a systemic error for all practical purposes?

This really isn't my area of expertise by the way. Chris Hobbs explains what he means in this YouTube talk.