Building software for the enterprise is hard. Many things go wrong if the important decisions on the architecture of the system were not previously planned.
One of the enduring problems related to this kind of software is how to manage data since such applications meant to grow fast and produce large datasets in a few months.
Since the application needs to be constantly evolving, it should be reliable, must be scalable and easy or not too hard to maintain.
In this post, I want to focus on the first pillar, reliability.
You may have an idea of what reliability means when it comes to software.
The concept is usually associated with fault-tolerant systems or systems that can tolerate failures and keep things running from the user's perspective.
A fault is usually defined as one component of the system deviating from its specs, whereas a failure is when the system as a whole stops providing the required services to the user.
Because of that, we should design systems that can tolerate-faults in three different areas, hardware errors, software errors and human errors.
In this particular section, I want to focus on one of the main reasons for software outages: Human Errors.
A study found that only 10-25% of outages were caused by hardware or network failures, and the other 75-95% were caused by configuration errors by operators, or in other words, humans.
So if human errors represent the majority of system outages, what can we do about it?
Well, most software is usually designed by humans - well, at least in 2020, not sure the next 5~10 years - and humans know not to be very reliable.
It's not our fault, we just can't sustain the same work quality level every single minute.
To solve that problem we need to design mistake-proof systems.
So instead of trying to cure failures, you should start preventing them:
Have a solid code review process. Bugs can be found on a good code review process. Those can prevent stupid things that people usually do when coding for several hours or days.
Make sure to have integration tests covering important features of the system. This kind of test makes sure that real-world business rules remain intact during the evolution of the system.
Look for design patterns to build high-level abstractions to prevent people from changing low-level functions.
Cut off production database access or restrict permissions to important services. Instead, build a CLI (Command-line interface) or an Admin interface so they can do their work without screwing things up.
"Adding human resources to a late software development project makes it later." - Brook's Law - So instead of hiring more people, try installing a solid process such as continuous delivery and continuous integrations to deliver more value in less time with more safety.
Provide sandbox environments so people can test configurations and new features before actually going to production. Providing a docker setup for developers that has the same configuration as the production servers will prevent configuration errors.
Use rollout strategies to prevent mass outages. It should be easy to revert configurations or deployments before they can cause a wide system failure. Blue-green deployments are essentials. When it comes to databases, make sure to use migrations to quickly revert things.
"A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system." - Gall's Law
Use logs to keep track of systems' health, together with monitoring and alerting services. Softwares as New Relic, Prometheus, Grafana, CloudWatch and PagerDuty combined can give you a clear view of your application and infrastructure health, as show you early warning signals of failures.
So, to sum this up:
The reliability of your system depends on how bug-free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. - Jay Kreps