Companies like Google and Amazon share a lot of great content about their approach to certain technical problems. At re:Invent this year, Amazon announced the Amazon Builders' Library. This is a collection of articles that discuss the approach Amazon takes in their architecture and software delivery processes. Similarly, Google shared a great collection of lessons in Site Reliability Engineering in their free SRE book.
In this post, we'll go over 7 site reliability lessons we can learn from these two great resources.
Good monitoring is a fine art. This chapter from the Google SRE book is a single stop for everything you need to effectively monitor your systems. It goes over the why, the what and the how of monitoring. A snippet from this chapter is actually in the pull request template of our monitoring repository. It has helped me multiple times to think again about how I wanted to monitor something:
When creating rules for monitoring and alerting, asking the following questions can help you avoid false positives and pager burnout: - Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible? - Will I ever be able to ignore this alert, knowing it’s benign? When and why will I be able to ignore this alert, and how can I avoid this scenario? - Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren’t being negatively impacted, such as drained traffic or test deployments, that should be filtered out? - Can I take action in response to this alert? Is that action urgent, or could it wait until morning? Could the action be safely automated? Will that action be a long-term fix, or just a short-term workaround? - Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?
Dealing with a high volume of alerts is really taxing. Next to the context switching, every alert is a new and potentially stressful situation that you have to evaluate. It's very common to start assuming the impact or cause of an alert, or to start ignoring alerts that trigger often. In the Google SRE book chapter there is a very concrete notion of what too much is in this case:
We’ve found that on average, dealing with the tasks involved in an on-call incident—root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs—takes 6 hours. It follows that the maximum number of incidents per day is 2 per 12-hour on-call shift.
Based on that, they state that most days should have zero incidents.
I've found that it's a really good exercise to go through and analyze all alerts of the last month. When you do that, you'll notice very quick which alerts are triggering often. You might also see patterns that are less obvious in the moment, for example certain alerts always triggering on wednesday morning. Doing this and then checking the alerts that triggered using the checklist above will help improve the quality of your alerts.
I really like the idea of simulating outages as a way to practice with debugging. It's a safe way to learn more about how your systems work. It can work as a tool to better understand previous outages and what can be done to make it harder for things to break the same way. Google does Disaster Role Playing, as an onboarding tool, to share knowledge between the different experience levels within their SRE group, and as a fun exercise.
Next to this, instead of role playing, you can also inject failures into your actual system in some way. This practice, when combined with a hypothesis of what impact the failure is going to have, is called chaos engineering. There's lots of tooling around nowadays that will allow you to manually make your application fail in any way you can think of. If you want to get started or know more about how you could do chaos engineering right, I highly recommend reading these blog posts by Adrian Hornsby.
It is tempting to try and make your systems "self-healing". However in most cases, when a failure is ongoing, automated changes make it harder to understand what is going on. In a lot of cases, the automated change may actually make things worse. A common example is the redistributing or resharding of data in the case a node in a cluster drops out. The data transfer and load this causes might actually have a bigger negative impact than the node not being there. When thinking about redundancy of infrastructure, make sure to also think about what needs to happen in case the redundant infrastructure fails. Although this is not always possible, ideally, nothing has to happen.
I went to a couple of talks at AWS re:Invent in which Amazon engineers described how they architect systems to improve reliability. There a related idea was referenced often under the name
Static stability. AWS has published a nice article in the Amazon Builders' library in which they explain how they apply static stability to EC2 and other AWS services.
When an application fails, this shouldn't bring other applications down too. There are several ways you can prevent cascading failures:
- Use circuit breakers. This means, stop calling a service if it seems like it is not working currently. This way, you won't overload the services that you depend on with requests.
- Use smart retry strategies.
- Add timeouts to requests. It is common for a failure to be caused by an application being overloaded with requests. This causes slow requests. Without timeouts, all services that depend on this application will also slow down.
Shuffle sharding is another really interesting way to reduce the impact of failure. In most examples this was used as a way to protect against malicious clients of an application. As a start, you can shard your clients into a couple of groups. For each shard, you have separate infrastructure running your application. Now, when a malicious client affects your application, only the clients assigned to that same shard will see the effects. This can greatly reduce the impact a single malicious client has.
Taking this one step further, you can put each client in two shards instead of just one. Now, a malicious client can bring down two shards. But the chance of other clients being assigned to exactly the same two shards is pretty small. It is likely that clients will see one of their shards fail, but a big group of clients can fall back on a shard that's still working.
If none of this made sense, be sure to read this blog post from the Amazon Builders' library. It does an excellent job at explaining and visualizing how it works.
In most cases, it's better to show stale data than no data. You don't want to mask failures to yourself, but maybe you do want to mask them to your customers. An obvious place to apply this is where you're currently caching results. You can either change the dynamic, making the thing you're calling push results to you. Or you can make the caching a bit smarter, storing (separate) entries for a bit longer, so you can fall back to those in case of failure.
Hopefully these tips have helped you think about how you can make your systems more reliable. There's tons of really good resources when it comes to reliability. I shared a couple already in this post, be sure to check those out!
I would love to hear your thoughts on all of this!