Despite being all around us, safety-critical software isn't on the average developer's radar. But recent failures of safety-critical software systems have brought one of these companies and their software development practices to the attention of the public. I am, of course, referring to Boeing's two 737 Max crashes, the subsequent grounding of all 737 Max aircraft, and its failed Starliner test flight.
How could such a distinguished company get it so wrong? Weren't the safety standards and certification process for safety-critical systems supposed to prevent this kind of thing from happening? Where was the FAA when the Max was being certified? These questions raised my curiosity to the point that I decided to discover what this specialized field of software development is all about.
In this post I'm going to share what I learned about safety-critical software development and how a little knowledge of it might be useful to "normal" programmers like you and me.
Safety-critical systems are those systems whose failure could result in loss of life, significant property damage, or damage to the environment.
Aircraft, cars, weapons systems, medical devices, and nuclear power plants are the traditional examples of safety-critical software systems.
But there are other kinds of safety-critical systems:
- code libraries, compilers, simulators, test rigs, requirements traceability tools, and other tools used to manage and build the kinds of systems we traditionally think of as safety-critical systems are safety-critical themselves
- software that provides data used to make decisions in safety-critical systems is likely safety-critical. For example, an app that calculates how much fuel to load onto a commercial jet is safety critical because an incorrect result from that app may lead to a life threatening situation
- other software running on the same system as safety-critical software may also be classified as safety-critical if it might adversely affect the safety functions of the safety-critical system (by blocking I/O, interrupts, or CPU, using up RAM, overwriting memory, etc.)
Safety-critical software development is a very specialized, expensive, methodical, slow, process-driven field of software development.
Typically, the software monitors its environments with various kinds of sensors and manipulates that environment by controlling actuators such as valves, hydraulics, solenoids, motors, etc. with or without the assistance of human operators.
These systems are usually embedded because they often have timing deadlines that can't be missed. Plus, the creators of these systems are responsible every aspect of the hardware, software, and electronics in the system, whether they created it or not. So the more stuff they put in the system, the more difficult and expensive it is for the company get it certified.
Safety-critical software systems are often distributed systems because:
- they are often large
- subsystems may be developed by different teams or companies
- many of these systems require redundancy and/or the separation of safety-critical code from non-safety critical code
All these factors push the system design towards multiple cooperating processors/subsystems.
There's a lot of this software being created. But it doesn't get nearly as much attention as consumer-focused software. For example, just about every story in Military & Aerospace Electronics discusses systems that contain safety-critical software.
It takes lots of resources and money to assemble the cross-functional expertise to develop, certify, sell, and support such complicated and expensive systems. So, while a few small companies might produce small safety-critical software systems, it's more common for these systems to be produced by larger companies.
3. Safety-critical software is almost certainly the most dependable software in the world at any given size and complexity
Software developed and certified as safety-critical is almost certainly the most dependable software in the world. The allowable failure rate for the most critical systems is absurdly low.
And these numbers are for the whole system (hardware and software combined). So the allowable failure rate for software at any given SIL level is even lower!
At SIL 4 (roughly equivalent to NASA's "maximum" category) we are talking about less than one failure per billion hours of continuous operation. That's 114,000 years!!!
To put that in perspective, if your smart phone were built to that standard its unlikely that you or any of your friends and family would have ever experienced a dangerous failure (dead phone, broken camera, random reboot, malware, data leakage, data loss, loss of connectivity, etc.), nor would you expect to encounter one for the entire span of your life! That's an incredibly high bar. And it just shows you how different this kind of product and software development is from the kind you and I do on a daily basis.
Here's my impression of where safety-critical software is supposed to fit in the software quality landscape:
Safety-critical software is designed, built, and tested to ensure it has ultra-low defect rates and ultra-high dependability.
The shuttle software is probably NASA's best example of software done right. At the time the article (linked above) was written, the shuttle software was 420 KLOC and had just one error in each of the last three versions. That's orders of magnitude fewer defects than your average software project. Read this article if you want to know what it takes to write the lowest defect software in the world.
(For the remainder of this post, I'm going to use the NASA Software Safety Guidebook as an example safety standard.)
In most companies, software developers are primarily concerned with getting the software to "work", then going faster, shipping more features, and delivering more value. But software developers involved in the creation of safety-critical systems must be concerned primarily with creating safe systems. And that is a very different task.
They must find all the realistic ways the system could create harm. And then develop ways to mitigate that harm. They also need to make sure the system, as built, functions as specified. They must convince themselves and their auditors that they've done an adequate job of building a safe system before their product can be certified.
We're way beyond unit testing, code reviews, and filling out some forms. We're talking about painstaking checking and re-checking, mandated processes, multiple rounds of analyses, and piles of documentation.
The image that came to my mind as I read through the NASA Software Safety Guidebook is that you have to jump through all these hoops as if you are preparing for a future court case where your product has killed a number of people. And your company and you personally are being sued for tens of millions of dollars. You need to be able to place your hand on a bible and swear that you and everyone working on your project:
- was properly trained
- followed best practices set out in the safety standard you followed
- minimized the risks of harm associated with your product as far as practicable
- earnestly worked to get your product certified as meeting your chosen safety standard by a competent certification body
And then you need to be able to point to a stack of boxes full of your documentation as proof that what you're saying is true, knowing that the plaintiffs' experts and lawyers are going to try to tear you apart.
If you feel like you might not win that hypothetical court case, then your product should not be released.
5. Safety and hazard mitigation requirements will dramatically increase the complexity of your product (and there will be a ton of them)
In addition to software engineering tasks associated with creating software for a normal product, you are going to have a lot of extra requirements related to safety and hazard mitigation. And, as we all know, as requirements interact code volume and complexity skyrocket. For example, we know that doubling the number of requirements will more than double effort required to complete most projects.
Plus, safety and hazard mitigation requirements are often difficult to implement both because they aren't "normal" things software developers do but also because they are inherently difficult. Let's look at an example.
Your system must fail safely. That's one of the fundamental principles of safety critical systems design. This means that under any reasonable scenario where the system is being used in accordance with the operating instructions, it must not cause a dangerous situation if something goes wrong.
For many systems that means stopping all actuators and reporting an error. For example, the blade in my food processor will stop immediately if I remove the lid while it is spinning. That's a simple case but for other systems, just figuring out how to fail safely is really difficult.
What does fail safely mean in the context of a heart bypass machine? The system can't just turn itself off if the power goes out because that will kill the patient.
So maybe it needs a battery backup to keep it going until the hospital's generator can kick in. And a charging sub-system to keep the battery charged. And a battery health monitoring subsystem to alert the maintainers of the system that the battery won't hold a charge any more. And there should probably be a condition in the code that will not let you start the bypass machine if the battery is defective. Or missing. Or discharged. And some kind of change to the display so the operator can be made aware of status of the machine's power.
By the way, adding a lithium ion battery to a safety-critical system isn't as straightforward as you would think. Even after extensive testing, Boeing's 787 still had battery problems in the field, which required an extensive investigation and changes to the design of the system.
Actually, after looking more closely at the requirements, a battery backup alone isn't enough because this system is required to continue to operate with two failures. So maybe we need two independent battery systems? Hmm... do we need to source each system from a different supplier so that a fault in one battery system doesn't bring them both down at the same time?
Or maybe we should install a hand-crank for the nurse to turn a dynamo, which can power the heart bypass machine if the power fails and the battery fails or is exhausted? And we'll need a subsystem to ensure the dynamo works and generates enough power to run the bypass machine. But even a new dynamo won't produce enough power to run all the machine's functions. So maybe we need a low-power mode where the machine runs essential functions only? Or maybe we need to redesign the whole system to use less power? And so on.
So that one safety requirement ("patient must not die if power goes out") ballooned into several explicit requirements and a bunch of questions.
And that's just one failure mode. We need to cover all of the realistic failure modes. What about:
- fluctuating power?
- dead power supply?
- damaged RAM?
- memory corruption errors?
- an exception coming out of nowhere?
- sensor failure?
- sensor disagreement?
- hard drive failure?
- a random hard shutdown during a surgery?
- pump failure?
- display failure?
- the case where the operator attempts to input settings appropriate for an elephant instead of a human?
- the case where the operator accidentally turns the machine off?
- a blown capacitor on one of the circuit boards?
- and... and... and...
It's pretty easy to see how the number of safety requirements could easily dwarf the functional requirements in a safety-critical system. And that the implementation of those requirements would drastically increase the complexity of your system for both hardware and software.
I was surprised to learn that NASA won't let you build anything you want as long as you take appropriate precautions and get it certified (pages 27 - 28).
NASA prohibits the construction of the following kinds of systems no matter how many software precautions you are willing to take:
- systems with catastrophic hazards whose occurrence is likely or probable
- systems with critical hazards whose occurrence is likely
- catastrophic is defined as: loss of human life or permanent disability; loss of entire system; loss of ground facility; severe environmental damage
- critical is defined as: severe injury or temporary disability; major system or environmental damage
- likely is defined as: the event will occur frequently, such as greater than 1 out of 10 times
- probable is defined as: the event will occur several times in the life of an item
A spacecraft propulsion system that will blow up the spacecraft unless a computer continuously makes adjustments to keep the system stable would almost certainly be rejected by NASA as too dangerous.
Even if the system designers believe the software reduces the risk of killing the crew down to 1 incident per 5,000 flight years, the system would likely be classified as "prohibited" because NASA knows that software is never perfect.
While there's some talk of using agile methods for safety-critical software development to improve speed and drive down costs, I don't think it's very realistic to think that agile will get very far in safety-critical circles. The Agile Manifesto defines 4 development values and 12 development principles and most of them are in direct opposition to the requirements of safety-critical software development.
"Welcome changing requirements, even late in development" seems to be a particularly problematic agile principle in the context of safety-critical software development. A single changed requirement could trigger a complete re-analysis of your entire project. If the change introduces a new hazard, it will have to investigated, analyzed, and possibly mitigated. The mitigation may result in multiple design changes. Design changes (hardware and/or software) may result in changes to the documents, safety manual, user manual, operator training requirements, test cases, test environments, simulators, test equipment, and code. Code changes require targeted retesting around the change. Plus, all your safety-related tests are supposed to be re-run for every code change. And, finally, you'll have to get your product recertified.
And that's the easy case. Imagine what would happen if your requirements change pushes your software to a higher safety level (say "minimum" to "moderate" for a NASA project or SIL 2 to SIL 3 for a IEC 61508 project)? Now you have to re-execute your project using the prescribed processes appropriate to your new safety level!
Most safety-critical software appears to be developed using the waterfall or spiral development models. NASA specifically recommends against using agile methods for the safety-critical elements of your software (page 87).
I've seen lots of examples where the V-model is mentioned as a specialized version of waterfall appropriate for safety-critical software development. In the V-model, the steps on the same horizontal level are loosely related. So, for example, you should be thinking about unit and integration tests when you are doing detailed design. Some sources even recommend writing the unit and integration tests during detailed design (before coding begins).
There's considerable controversy about what should be considered safety-critical. If a doctor uses a calculator app on her smartphone to calculate an IV drug dose for a patient on a very dangerous drug is the app safety-critical? What about the math library used by the calculator? What about the phone's OS and the all the other software on the phone? The answer is unclear.
Should doctors be forced to use certified calculators to determine IV drug dosages? Are calculators certified for safety-critical applications even available? I know this sounds like nitpicking; it's obvious that calculators work correctly, right? Actually, no. Harold Thimbleby has made the study of medical mistakes involving technology his life's work. And I found this video of him demonstrating numerous errors in the functioning of regular calculators deeply unsettling. It's not just one or two calculators that have errors. He calls them all "crap."
But it's not just calculators that we have to worry about. How many spreadsheets out there contain safety-critical data and calculations?
Looking in from the outside, authorities don't seem too concerned about the use of normal software for safety-critical functions. However, as a "regular" developer you might want to be aware that your software could potentially be used this way. Unless you are encouraging this use of your software, you are probably in the clear from a legal liability point of view but the ethics of it might be another story.
Many people criticize the safety standards for doing a poor job of addressing security (see the criticisms section below). So it shouldn't surprise you to learn that security researchers have found glaring security vulnerabilities in many classes of safety-critical products.
Because your entire safety case relies on your software behaving exactly as specified, you have a major problem if someone can make your system misbehave.
I was hoping that my investigation into safety-critical software development would introduce me to some secrets of fast, high quality software development beyond what I've already discovered. I was looking for some tools or techniques to give me an edge in my day job. But I found nothing useful.
Safety-critical software development succeeds, for the most part, by throwing large quantities of money and people at the problem of quality. I think there's definitely an element of "do it right the first time" that's applicable to all software development efforts where high quality is desirable but there's nothing magical happening here. For the most part it's process, process, and more process.
The closest thing I found to "magic" is automatic code generation from simulation models and correctness by construction techniques. But neither of these approaches are going to benefit the average developer working on a non-safety-critical project.
There are a bunch of safety standards out there. If you are developing your own system, the standard you use may be dictated by your industry. Or, under certain circumstances, you can choose the standard you intend to meet. Then you assemble a competent team, build your product, and attempt to get your product certified by a certification body or you might chose to self-declare that your product meets the standard, if that's an option in your industry.
If you're building a product for someone else they may dictate the standard you must meet as part of your contract. If your customer is NASA then you are expected to follow the NASA Software Safety Guidebook for software development (and possibly some others, depending on the nature of your system).
The rules are very complicated but the basic idea is that you have to evaluate how dangerous your product is, the probability that the danger might occur, and the extent to which your software is expected to mitigate that danger (as opposed to hardware controls or human intervention).
Software that ensures a consumer pressure cooker doesn't over-pressure and explode would be certified to a lower level than the software that autonomously controls the safety functions of a nuclear power plant. So you'd have to follow a more rigorous process and do more work to certify the software for the nuclear power plant.
NASA has 4 levels of certification for safety-critical software: "full", "moderate", "minimum", and "none" (page 28). The NASA Software Safety Guidebook will help you figure out which level your proposed system is at.
It's also possible to have a mixed level system. For example one sub-system might be at the "full" level but another sub-system might be at the "minimum" level.
Once you know what level(s) your product is at, you can look up which processes, analyses, and documentation, you need to follow, perform, and create to achieve that level. Each development phase has a table that tells you what you need to do (other standards have similar tables).
(I think these tables are interesting enough that I included the tables for each phase of the development from the NASA Software Safety Guidebook below. But you don't have to read them to understand the remainder of this post.)
Conception (the stage before requirements where you are defining what kind of system you need) (page 102):
Requirements (page 107):
Design (page 136):
Implementation (coding) (page 167):
Testing overview (page 181):
Operations and maintenance (page 199):
As you can see, even at the "minimal" safety level NASA wants you to do many, many things that most software projects never do.
Let's take a closer look at the dynamic testing table to illustrate my point. Even if you already routinely unit test your code for non-safety critical projects, I bet you aren't that thorough. Maybe you skip things that are particularly difficult to test and either not test them at all or "sort of" test them manually? Well, that's not good enough for safety-critical software. And getting to "good enough" might take more time than all the easy testing combined.
- typical sensor values
- extreme values of inputs
- all modes of each sensor
- every statement, branch, and path executed
- every predicate term tested
- every loop executed 0, 1, many, max -1, max, and max + 1 times
It doesn't take much imagination on your part to realize a few things are going to happen if you attempt to follow these testing requirements to the letter:
- you need to write all production code for testability, which will negatively impact other properties of your code
- the volume of your test code is going to be larger than the volume of your production code (probably by a wide margin)
- it's going to be difficult for you to write your test cases in such a way that you can track each of them back to the dynamic testing category it covers as well as the specific requirement it satisfies
- it's going to be even more difficult for you to know that you've created enough test cases to satisfy the dynamic testing requirements for every subroutine in your code base
- because you are testing so deeply in your system (not just the public API like most unit testers recommend for non-safety-critical systems), you are going to find it nearly impossible to refactor your code without causing chaos in your test suite
- if QA discovers a problem with your product, it's going to be a nightmare to modify your code and test cases without making any mistakes
There are two ways to go here.
In some industries you can simply declare that your product meets all the requirements of some standard, possibly with the assistance of an independent company to confirm the declaration.
While this might seem like an easier route to certification, your customers might not accept it. Plus, I'm not actually clear on when or under what circumstances you can self-declare that your product meets a certain standard. NASA doesn't allow it. And, based on what I've read, it seems unlikely that the FAA and the FDA allow it either. But other industries might be different? Please leave comment on this post if you know how self-declaration works.
The other way to go is to hire a certification body to independently evaluate your product and issue a certification if your product meets the standard you seek. This is the more rigorous of the two routes.
If you are going to work with an external company to help certify your product, they should be brought into the development process as early in your project life cycle as possible. It is not uncommon for an entire product to be developed and then fail to receive certification because of a mistake made very early in the development process. Adding missing requirements, processes, or documentation to a product after it has been constructed can be virtually impossible.
Certification is not guaranteed. Only 25% of companies seeking certification received it in this paper. I'm sure the rigor of certification varies from industry to industry and SIL to SIL but, as far as I can tell, it's a serious process everywhere.
If you are working on a NASA project that is particularly high risk or high value, NASA may assign an Independent Verification and Validation (IV&V) team to your project (page 102). This team is tasked with ensuring that your project is on track to deliver the quality and functionality required for a safe and successful system/mission. The work the IV&V team does is over and above all the work you are expected to do for the project, not a replacement for it. The IV&V team can also act as a resource to answer your questions and help you tailor your effort appropriately to the risks of your project.
You are responsible to tracking usage and defects in your product as it operates in the field. The scope of this activity varies depending on the kind of product you are creating but the goal is to gather data to show how dependable your system is in actual use.
For example, Rolls-Royce jet engines use satellite communications to "phone home" with engine telemetry. Rolls-Royce collects data on the number of hours of use, along all kinds of performance, fault, and failure data for their engines. You probably can't get better tracking than that.
On other other end of the spectrum, something like an airbag in a car might rely on estimates of deploy events and consumer reports of malfunctions. This is much lower quality tracking data than Rolls-Royce collects but it's still better than nothing.
I've described how safety-critical software systems are supposed to be built. But the reality of the situation appears to be quite different. It didn't take me long to realize that some companies approach safety-critical software development with little regard for the standards or the seriousness of what could happen if their systems fail.
The Barr Group surveyed embedded developers around the world and found a shocking lack of best practices on safety-critical projects.
28% of all respondents work on safety-critical projects.
Only 67% of the safety-critical subset meet a safety standard!
Coding standards, code reviews, and static analysis are all mandatory at my e-commerce job because they are cost-effective, proven ways to improve quality. But a significant portion of the safety-critical subset don't do them.
And it just gets worse. Only 59% of the safety-critical subset do regression testing! Less than 100% do system level testing!
These numbers are hurting my brain!
And there's much more to be concerned about in this survey. I encourage you to watch the CEO of the Barr Group go over the survey findings in this YouTube talk.
Anyway, if you look at these results and compare them to all those tables I reproduced from the NASA Software Safety Guidebook, its pretty easy to imagine how many of the other tasks are not being done on safety-critical projects in the real world.
If you haven't heard Toyota's unintended acceleration problems I suggest you read a brief summary on wikipedia to get yourself up to speed. This is definitely a safety-critical software system and it is a mess.
Let's set the hardware and safety culture problems aside for a moment and just look at some of the software findings for the Electronic Throttle Control System. It's the system responsible for reading the pedal position and controlling the engine throttle settings:
- over 250 KLOC of C (not including comments)
- 2,272 global variables
- 64 instances of multiple declaration of the same global
- 333 casts alter value
- 22 uninitialized variables
- 67 functions with complexity over 50
- one function was found to have a complexity of 146. It contained 1,300 lines of code and had no unit tests
- bugs that can cause unintended acceleration
- gaps and defects in the throttle failsafes
- blackbox found to record incorrect information
- stack overflow and defects that can lead to memory corruption
- and more...
Anyone who has made it this far in this post knows this is not the kind of code you want to see in a safety-critical system.
For more details please watch Philip Koopman point out multiple problems with Toyota's hardware, software, and safety culture (slides). It's an eye-opening talk.
I ran across a video several years ago where a non-programmer was describing how he glued together a bunch of tech to allow his continuous glucose monitoring device to send commands to his insulin pump to automatically adjust the amount of insulin it delivered to his body to control his diabetes.
As a professional software developer, who is very aware how difficult it is to write correct software, I was very alarmed by what this guy was doing. He knew basically nothing about programming. He was building this system from tutorials and trading information with other non-programmers who were also working on the project in their spare time. I believe it was literally the first thing he ever programmed and it is definitely safety-critical--too little or too much insulin can definitely kill you.
Now, if you want to take your life into your own hands and build something like this for your personal use, I say go for it. But these people were sharing the design and code on a website and making it available for anyone to use, which is not cool in my opinion. The FDA issued a public warning not to build your own "do-it-yourself" pancreas after watching from the sidelines for several years.
Anyway, this is another example of how the theory is radically different than the reality of safety-critical software development.
Two Boeing 737 Max crashes and a failed Starliner test flight are what inspired me to write this post. But as I dug deeper and deeper into safety-critical software development and safety-critical software development at Boeing in particular I've realized that this topic deserves its own post. I'll link to that post here when I publish it.
The standards are designed and approved by committees. They are based on what the drafters could get approved, instead of what is most appropriate or what is backed by evidence. While there appears to be broad agreement that coding standards, design reviews, code reviews, unit testing, and the like are good things, the standards disagree on specific practices and approaches.
One criticism I read suggested that industry wants to minimize the work prescribed by the standards so it lobbies against adding more prescriptions to them. In some cases it is because they want to minimize their work to get their products certified. In other cases, it is so they can pursue the most effective techniques as they become available instead of being forced to do something they know is ineffective compared to the state of the art, just because the standard says they have to do it.
There is actually little evidence that following the prescribed activities will result in the expected quality/failure rates experienced in the field. I think page 12 of the NASA Software Safety Guidebook describes the situation very well:
Opinions differ widely concerning the validity of some of the various techniques, and this guidebook attempts to present these opinions without prejudging their validity. In most cases, there are few or no metrics as of yet, to quantitatively evaluate or compare the techniques.
Other people criticize the standards because they are mostly silent on security. I think that's a valid concern. If your product isn't secure, it's hard to argue that it's safe.
Here's another slide from the Barr Group's 2017 Embedded Safety & Security Survey. It shows that only 78% of the respondents who are working on safety-critical projects that are connected to the internet even have security as a requirement. It's not even on the radar of 22% of projects! Now, how many of the 78% of respondents who do have security requirements do you suppose are doing security well?
Here's the compliance with coding standards, code reviews, and static analysis for the safety-critical and connected to the internet subset. The ratios are about the same as the safety-critical only subset I showed you previously in this post. If you're not following a coding standard, doing code reviews, or using static analysis, I have a hard time believing you're somehow writing ultra-secure code for your internet connect, safety-critical product.
Nancy Leveson shows many examples where following the standards isn't enough. Just as we acknowledge that all software contains defects and plan accordingly, we should also acknowledge the possibility that our software may also contain/experience:
- component interaction failures
- requirements errors
- unanticipated human behavior
- system design errors
- non-linear or indirect interactions between components or systems
Leveson argues that we need to build safety-critical systems that can cope with these factors. She proposes STAMP as an additional approach to improve system safety for these types of issues. She has a great talk on YouTube that covers the basics.
By the way, she's not an odd-ball spouting BS from the sidelines. She's the Professor of Aeronautics and Astronautics at MIT and she works with the DOD, NASA, and others on safety-critical systems.
Here are some of my ideas for where safety-critical software development is going and some of the challenges we'll face in getting there.
- autonomous cars and driver assist features
- internet of things (including the industrial internet of things)
- robotics and industrial automation
- military systems
- medical devices and healthcare related applications
- space-based applications
- demand for larger, more complex systems surpasses our ability to build them to the quality standards this kind of software requires
- we need a way to drastically reduce the cost and time required to produce this kind of software
- how can we incorporate AI and machine learning into safety-critical systems?
- how do we allow safety-critical systems to communicate with each other and with non-safety-critical systems without degrading safety?
- security is going to continue to be a thorn in our sides
- more and more software is going to be used for safety-critical purposes without having been designed and built to safety-critical standards (like the doctor calculating IV drug doses on her smartphone)
- how do we attract sufficient numbers of talented software developers to work on safety-critical projects?
- NASA Software Safety Guidebook (pdf)
- Great Introduction to safety-critical systems (pdf)
- Martyn Thomas talk on safety-critical systems and criticisms of the standards (video)
- Martyn Thomas talk on correctness by construction techniques (video)
What do I want you to take away from this?
Safety-critical software is all around us and we can only expect more of it as we connect more of the world to the Internet, make "dumb" devices "smart", and invent entirely new products to make our lives better.
Developers of safety-critical software systems feel the pain of the lack of maturity of the software engineering field more than most of software developers. Much to my disappointment, they don't appear to be hoarding any magic tools or tricks to get around the problems we all face in trying to develop complex products. Aside from doing more up-front planning than the average project, they just throw money and process at their problems just like the rest of us.
There's a wide gap between how safety-critical software systems are supposed to be developed and certified and what actually happens. I was shocked to learn that many safety-critical projects don't use coding standards, static analysis, code reviews, automated testing, or follow any safety standard at all!
If you find yourself working on such a project, maybe take some time to consider the potential consequences of what you're doing. There are plenty of jobs for talented developers where you aren't asked to endanger people's lives.
Despite the many challenges of safety-critical software development, most safety-critical software systems appear to be, more or less, safe enough for their intended purpose. That doesn't mean that it doesn't contain errors, or even that it doesn't kill people (because it almost certainly does). What I mean is that most safety-critical software is trustworthy enough to be used for its intended purpose as evidenced by the public's willingness to fly on planes, live near nuclear power plants, and have medical devices implanted in their bodies.
Until Boeing's recent problems with the 737 Max, the go-to example of a safety-critical software error resulting in death was malfunction of the Therac-25 radiation therapy machines. An investigation revealed that the machines were pretty much a disaster from top to bottom. And between 1985 and 1987 a handful of people received harmful doses of radiation from Therac-25 machines and a couple of people died.
That the Therac-25 was the go-to example of a failure of safety-critical software causing death says something about just how well most of these systems perform in the real world. After all, safety-critical software is controlling nuclear power plants, medical devices, rail switching, elevators, aircraft, traffic lights, spacecraft, weapons systems, safety features in cars, and so on. But when is the last time you heard about a software error causing a safety-critical system to fail and kill people (besides the Boeing 737 Max and Tesla autopilot fatalities)?
Have a comment, question or a story to share? Let me have it in the comments.
Enjoy this post? Please "like" it below.