Any complex software must be cynical if it has to be stable. Such systems expect bad things to happen, and do not trust external external systems. They are designed and developed in a way that it doesn’t even trust itself, so internal barriers are put to protect itself from failures. They also avoid getting too intimate with other systems, because it could get hurt.
When writing complex enterprise software, it is practically impossible to ensure bug free code when you release to production. In an early stage startup environment, the focus is on implementing features at minimal cost, which would lead to choosing the easiest design over the more costly but correct design. This happens mostly because you have a very short runway before you go bankrupt, and the business team might be following multiple leads or features out of which only one or two would ultimately make the product a success.
Highly stable design usually costs the same to implement as the unstable one. The product as a whole being “stable” doesn’t mean that your individual applications or services have to stay up and running, but rather that the user can still get the work done. No matter how smart the engineers you hire are, individual services are going to crash. A robust system keeps processing user requests even when there are component failures.
Before the catastrophic failure, some component of the system has to initially fail. The key is to contain the trigger, preventing it from spreading and taking the entire system down. Denying the inevitability of failures robs you of your power to control and contain them. Once you accept that failures will happen, you will have the ability to design your system’s reaction to specific failures.
At each step in a chain of failure, the crack may accelerate, slow or stop. A highly coupled system with many degrees if coupling offers more pathways for the cracks to propagate along.
The initial design decisions you take has a huge impact on stability in the long term, but those ironically are the exact decisions you have to make in the early stage with the least amount of knowledge. These are also sadly the decisions which are the most difficult to reverse later on. Stability can’t and should not be traded for velocity.