The book Release it! talks about all things that separate feature-complete applications from production-ready applications. It's a collection of practical and concrete patterns that will help you improve the architecture of your systems. I really liked reading it and have already found some ways to apply the knowledge from the book.
Let's go over a small selection of the takeaways.
The book talks extensively about integration points. Integration points form the border between components that talk to each other. Because in almost all cases a network is involved in communication between components, it is a matter of chance whether everything is sent and received as was intended. Add to that the possibility of each component to be unavailable at any time. The way you set up your integration points defines a big part of the stability of your platform.
What happens when one component is unavailable? In a lot of cases, this will result in other components also being unavailable. Imagine a webshop that verifies your credit card information in its checkout using a third party service. Depending on business requirements, the credit card verification being down may result in customers not being able to finish their order. Depending on implementation, the service being down may make the whole website blow up because of a large queue of requests waiting on an unresponsive service.
Given a system consisting out of several components, each with several integration points between them, there is a big risk of this type of cascading failures. How can we prevent failures from spreading through our system?
To reduce the impact of failures in other components, you can safeguard your integration points in multiple ways. The most important way is using timeouts on calls you do to remote components. With timeouts you can make sure that your component doesn't waste too much resources on a call. A regular pattern when a component fails is an increased number of threads used in the calling component. In most cases there are limits to how many threads can run at the same time. Using timeouts, you reduce the impact of these idle threads on your component.
When you know a component hasn't been responding as expected for the last couple of minutes, it doesn't make sense to keep trying. It could be that your component is actually making the situation worse by trying anyway. Instead, by applying the Circuit breaker pattern, a call to a failing component will be cancelled before it is even started. Every once in a while, a call will be let through to check if the failing component has restored. This reduces the stress on the already failing component and makes the calling component use less resources.
To verify that your components are resilient to failure in other components, you can (and should!) test exactly that. Popularized by Netflix, Chaos engineering is the principle of regularly removing or disabling parts of a system to verify it behaves well without that part. This could be shutting down a (virtual) machine, changing firewall rules to block specific traffic, or maybe even pulling the plug on a whole data center. By making this a regular thing, you'll have to make sure your components handle this well, and you won't be caught off guard when it happens for real.
It is common to make the things you want your system to achieve regarding availability and performance very specific. Most often this is done by agreeing on a Service Level Objective (SLO). For example, you can agree that
99% of search requests by customers should be handled in less than 1 second. If not meeting this goal has consequences (e.g. a refund or a fine), those consequences are then captured in a Service Level Agreement (SLA) together with the related SLO's.
A SLO can be a useful tool to make discussing tradeoffs around availability, performance and costs easier. Applying a credit card to improve availability or performance is always an option, a relevant SLO will help determine if that is the right thing to do.
The hard thing about these kind of goals is the way they are influenced by other components. If you require another component with a certain SLO to be available, then your component can never provide a better SLO. This effect is multiplied if there are multiple dependencies, which is often the case. In the book this is called SLO inversion.
Now you could determine a realistic SLO by looking at all integration points of your component, as well as its own expected availability based on hardware and connectivity. However, it makes more sense to go for the opposite approach, first determining a SLO and then making changes (if necessary) to your infrastructure to conform to that SLO. This is the approach also described in the Google book on Site Reliability Engineering, which stresses to not "pick a target based on current performance".
If your application is not available, chances are you are losing money or happy customers. The tips in this book decrease the chances of downtime and decrease the duration of downtime if it still happens. It contains an incredible amount of good advice and it is presented in such a way that makes it easy to start improving things based on the advice given.
It's one of the most useful books I've read, I would definitely recommend you check it out!