On January 28, 1986 the Challenger launched into a clear blue sky. It accelerated quickly and then unexpectedly exploded. The failure of the Challenger put serious scrutiny on NASA, and resulted in a presidential commission to investigate the accident.
One of the members who was called to serve on the commission was Richard Feynman, a well respected physicist who had helped develop the theory of Quantum Electrodynamics. In his book called 'What Do You Care What Other People Think' he spends the majority of the book recounting his experience of the investigation of the Challenger accident. He describes how in a public hearing he demonstrated how cold weather would have reduced the flexibility of the O-rings leading to the failure of the solid rocket boosters and the total loss of the spacecraft.
He performed this demonstration with typical Feynman style, with a glass of ice water and a sample of O-ring from a solid rocket booster. In front of a packed public meeting he showed how the O-ring lost its resilience in the cold. He found out that NASA had accepted damage to the O-rings in past missions, and had used a statistical method of predicting the failure of an O-ring. In other words NASA was reducing the safety criteria.
Feynman went on to investigate the Shuttles main engines to see if the same slowly reducing safety factors that resulted in the Solid Rocket Booster failure were common in the main engines as well. He found a similar pattern of technical problems and slowly reducing safety margins.
Jet aircraft engines such as used on the shuttle are very complex systems. The usual way to develop jet engines is to design components first and test them in detail, build subsystems from these components and test the subsystems in detail before designing and building the final engine. This way if there is a design defect in a component it is found early in the development process.
Feynman describes how the entire shuttle engine was designed up front, rather than using this standard way to develop engines. This approach was chosen to save time. The problem with this approach was that defects only became apparent in a full engine rig. This meant that when components had design defects whole subsystems may require redesign.
If the problems could have been found at the design stage it would have saved billions of dollars in testing and redesign of the engines. However the nature of the engine design faults where subtle, usually the result of interactions that would have been impossible to model in the design stage. Many of the problems were not easily diagnosed when they had a full test engine, so to expect the design stage to find these faults would be grossly unrealistic.
By not following a process of gradual development from simple components through subsystems to full engines like the commercial jet aircraft industry did, and rather taking an up front design approach NASA found itself dealing with a huge number of design faults which were difficult to analyse and solve.
What can we learn from this as software developers? One lesson is that although we would like to catch all faults in the design stage, complex systems have subtle interactions which can only be fully explored through actual implementation. It shows us how faith in up front design isn't warranted, and that we should develop complex systems from the bottom up, rather than from the top down, developing and testing sub units.
In many respects it is confirmation of the Unit Test approach, where every sub unit of code has independent tests to ensure proper operation as individual units. These units can then be used in modules composed of these smaller units, and eventually into the complete system. It also points to failure modes where small defects in subsystems can lead to catastrophic failures at a higher level.
Experience from outside the software development field can inform and educate us about our art. Understanding how complex systems can interact and fail is critical. Perhaps we should pay more attention to engineering principles employed outside software development more often?