Rajasegar Chandran

Posted on Dec 7, 2022

Everything will ultimately fail

#architecture #productivity #fallibility #redundancy

In this post, we are going to take a look at why we need to design failure modes for our software and applications, so that we can contain the damage and protect the rest of the system from complete failures.

Before diving into the topic, first we need to understand the fallible nature of things around us from hardware, software to networks. Even us, human beings are not infallible.

Always bear human fallibility in mind
-- David Deutsch, The Beginning of Infinity

Hardware is fallible

Hardware is fallible, so we add redundancy. In engineering, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance. This allows us to survive individual hardware failures, but increases the likelihood of having at least one failure present at any given time.

Software is fallible.

Our applications are made of software, so they're vulnerable to failures. We add monitoring to tell us when the applications fail, but that monitoring is made of more software, so it too is fallible.

We are fallible also

Humans make mistakes; we are fallible also. This fallibility is the hardest thing for us to grasp. We have limited knowledge and the limits of our knowledge routinely prevent us from realizing just how much we do not know.

According to Jonathan Crowe, an Associate Professor in the T C Beirne School of Law at the University of Queensland, there are three forms of human fallibility.

Epistemological fallibility
Psychological fallibility
Ethical fallibility

Laws which where once presented as the decrees of a god-given thing are now frankly confused commands of fallible men
-- Will Durant, The Lessons of History

So, we automate actions, diagnostics, and processes. This leads to a phenomenon called "Automation bias". It is the propensity for humans to favor suggestions from automated decision-making systems and to ignore contradictory information made without automation, even if it is correct. Automation removes the chance for an error of commission, but increases the chance of an error of omission.

Errors of Commission

Commission errors occur when users follow an automated directive without taking into account other sources of information. They result from a combination of a failure to take information into account and an excessive trust in the reliability of automated aids.

Commission errors appear for three reasons:

overt redirection of attention away from the automated aid
diminished attention to the aid
active discounting of information that counters the aid's recommendations

For example, a spell-checking program incorrectly marking a word as misspelled and suggesting an alternative would be an error of commission

Errors of Omission

Omission errors occur when automated devices fail to detect or indicate problems and the user does not notice because they are not properly monitoring the system. They have been shown to result from cognitive vigilance decrements.

For example, a spell-checking program failing to notice a misspelled word would be an error of omission.

No automated system can respond to the same range of situations that a human can. Therefore, we add monitoring to the automation. More software, more opportunities for failures.

Networks are fallible

Networks are built out of hardware, software, and very long wires. Therefore, networks are fallible. Even when they work, they are unpredictable because the state space (set of all possible configurations) of a large network is, for all practical purposes, infinite. Individual components may act deterministically, but still produce essentially chaotic behavior.

Every safety mechanism we employ to mitigate one kind of failure adds new failure modes. We add clustering software to move applications from a failed server to a healthy one, but now we risk "split-brain syndrome" if the cluster's network acts up.

Split-brain syndrome indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other.

The Three Mile Island

It's worth remembering that the Three Mile Island accident was largely caused by a pressure relief value—a safety mechanism meant to prevent certain types of over-pressure failures. It is the most significant accident in U.S. commercial nuclear power plant history. On the seven-point International Nuclear Event Scale, it is rated Level 5 – Accident with Wider Consequences.

So, faced with the certainty of failure in our systems, what can we do about it? Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures.

Crumple Zones

Automobile manufacturers create something called crumple zones — areas designed to protect passengers by failing first. They are also called crush zones or crash zones, which are a structural safety feature used in vehicles, mainly in automobiles, to increase the time over which a change in velocity (and consequently momentum) occurs from the impact during a collision by a controlled deformation.

Typically, they are located in the front part of the vehicle, to absorb the impact of a head-on collision, but they may be found on other parts of the vehicle as well. Similarly, in software, you can create safe failure modes that contain the damage and protect the rest of the system.

If you do not design your failure modes, then you will get whatever unpredictable—and usually dangerous—ones happen to emerge.

DEV Community

Everything will ultimately fail

Hardware is fallible

Software is fallible.

We are fallible also

Errors of Commission

Errors of Omission

Networks are fallible

The Three Mile Island

Crumple Zones

References:

Top comments (0)

Read next

🔥 200 Project Ideas from Beginner to Advanced with Open Source Contributions 🚀

How to estimate Java object size

Black Friday 2024: Save 25% on Polypane

Integrate Daytona in your NextJS app