Written by Manuel Vogel.
This blog post is part of a series about the [AWS Well-Architected Framework], what it is, why it makes sense, and how we at kreuzwerker do it. In this entry, we will focus on the Reliability Pillar.
What it is - A quick recap
Using their architects' and clients' collective knowledge and experience, AWS is continuously working on a Well-Architected Framework, which consists of key concepts, design principles, and best practices for architecting and running workloads in the AWS Cloud. AWS developed a Well-Architected Framework to understand what makes some customers succeed in the cloud while others fail. They also wanted to identify common problems, decisional and architectural patterns, and anti-patterns. In other words, what is Well-Architected and what is not, and to make this knowledge available to all, regardless of whether someone is just considering migrating to the cloud or is already running thousands of workloads there?
The Well-Architected Framework is built on six pillars
- operational excellence 👨🏽💻
- security 🔒
- reliability 💪🏾
- performance efficiency 🚀
- cost optimization 💵
- sustainability 🌳
The AWS Well-Architected Review process provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs. It is based on the previously mentioned six pillars.
It's important to note that the Well-Architected Review is not an audit. It's nothing to be afraid of; there are no penalty points for not getting things right the first time. A Well-Architected Review is a way of working together to improve your architecture. The process leads through several foundational questions and checks. It has been derived from years of experience working with the AWS cloud regarding security, cost efficiency, and performance. Hence, it provides sound advice on improvements. It helps you to build secure, high-performing, resilient, and efficient infrastructure for your applications and workloads.
The hard facts about AWS Well-Architected reviews in 2022 are:
- it consists of 58 questions in total across all pillars
- it takes around 4-6 hours for one workload (without tool support)
- the goal is to remediate 45% of the high-risk findings with a minimum of 20 questions answered.
We describe the process from our perspective in more detail here.
How we do it at kreuzwerker
Why should you do it with us?
How do we perform such a review?
For us, it's an interactive process: we inspect and adapt every time we do it by requesting feedback from our clients and doing a short internal retrospective. As of now, we perform it as follows:
- We do it in 2 blocks from 09:00-12:00 and 13:00-15:00 with a lunch break. However, we can continually adapt if we are faster, e.g., we shift the gap, and we are also flexible whether doing it remotely or at your office.
- We do it in an interactive, story-telling mode. This means: you talk, we listen, and then dig deeper into specific areas while being able to cover multiple questions.
- Our process is supported by tools (more in the other part of the blog post series 🥳)
We do not just handle the questioning but give guidance to answering them.We can tell you how and why there could be improvements to be made.
It is about the ability of a workload to perform correctly and consistently in its intended way. This includes operating and testing the workload through its total lifecycle.
In a nutshell
We all want our workloads to be reliable, available 99,9...9% of the time, and to prevent failure. And if failures do occur, then handle them gracefully. Like Netflix says
The best way to avoid failure is to fail constantly
Achieving this is all about the foundations, how you architect your workload, how you apply and monitor changes, and how your workload detects and handles failures. It depends on
- Resiliency is the ability to recover from infrastructure or service disruptions, dynamically acquire resources to meet demand, and mitigate disruptions, such as network issues. In the cloud, everything can fail all the time, and the more loosely coupled your architecture is, the more it needs to handle network issues, like timeouts, high latency, etc.
- Availability is the percentage of time that a workload is available for use. For example, the availability of 99,999% allows only maximum unavailability of 5 minutes per year. And doing the math, the more dependencies you have in the request chain, say three services, then the availability of 99.99% for each service results in an overall availability of 99,97%.
- Disaster Recovery Objectives are recovery strategies in the event of a disaster. Here two metrics are important: Recovery point (RPO) and Recovery time (RTO). And of course, the costs if you want to have low RPO and RTO values, which the following graphic illustrates:
After summarizing this pillar from our point of view, let's talk briefly about the design principles that navigate us through each pillar.
All pillars have their design principles, and they guide us through them. For the reliability pillar, they are as follows:
- Automatically recover from failure: You can trigger automation when a threshold is breached by monitoring a workload for key performance indicators (KPIs). These KPIs should be a measure of business value and technical aspects. This is the base to automatically notify and track failures and install automated recovery processes that work around or repair the failure. If you think one step further, then with more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.
- Test recovery procedures: You can and should test how your workload fails and validate your recovery procedures in the cloud. Typically most companies don't do this. You can use automation, e.g., in pre-production environments, to simulate different failures or recreate scenarios that led to failures in the past. This proactive approach exposes failure pathways that you can test and fix before an actual failure scenario occurs.
- Scale horizontally to increase aggregate workload availability: Replace extensive resources with multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across various, smaller resources to ensure they don’t share a common point of failure.
- Stop guessing capacity: A common cause of failure in on-premises workloads is resource saturation when demand exceeds that workload's capacity. It is not so easy to scale in-premises. This can likely happen, for example, in the case of a denial of service attack. In contrast, you can monitor demand and workload utilization in the cloud and automate the addition or removal of resources. So it is possible to maintain the optimal level to satisfy demand without over-or under-provisioning, for example, based on specific utilization metrics. There are still limits, but some quotas can be controlled, others can be managed, while still others are unchangeable (see Manage Service Quotas and Constraints).
- Manage change in automation: Changes to your infrastructure should be made using automation and IaC Tool, such as CDK or terraform. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed, for example, in a VCS such as git and services such as GitHub with a branching model, such as git-flow.
The architectural improvement process includes understanding what you already have and what you can do to improve the current state of your workload architecture. It selects targets for improvement, tests and adapts them, and quantifies your success. Afterward, you share what you have learned so that it can be replicated elsewhere, and then you repeat the cycle ♻️
- Setting the Foundations
- Being aware of the service quotas is the first step. We notice that many clients are aware of them when they first hit them. Our approach is to inform our clients, set the alarm when the number is coming close to its limits, and incorporate them in their architecture decisions. For example, when designing a multi-tenant architecture, each client should have a separate S3 bucket: the limit of a bucket per account is 1000. So it might make sense to think about a proper prefix schema in a single bucket.
- For the network topology, we recommend using AWS DNS service Route53 and Cloudfront. These services are protected by default in the DDoS protection service AWS Shield. Furthermore, we also recommend the usage of AWS Transit Gateway if we hear that the network is planned to be expanded or there is a multiple VPN and Direct Connect connection planned.
- Rethink the Workload Architecture
- We suggest the AWS Builder Library, which is how AWS builds and operates its software.
- The next big topic is workload segmentation: how are the contracts? How tight or loose is the coupling? Do we see a possible future improvement in using services like SQS or Eventbridge for a more event-driven architecture? How are the requests structured, e.g., are they idempotent? Are re-tries, backoff strategies, throttling, and timeouts in place, which are crucial in distributed systems?
- How does your workload scale? Do you have a mechanism in place to perform load testing?
- Properly implement Change Management
- A lot of clients have monitoring in place, but not all of them have it adapted for monitoring when changes occur. We ask which metrics they generate, how they aggregate them if they get alarmed, and how they get alarmed. We also think tracing is crucial for being able to find the root cause and location if failures occur quickly. Most clients have never heard of the possibility of automated responses and remediation, such as Systems Manager (SSM) automation. We create awareness and also add example implementations.
- Some clients have runbooks in place. We find them crucial as they are well-defined responses and procedures for known events, such as deployments. Most clients have tests as part of their deployment pipeline. However, very few test how to roll back in case of failure. We point out such cases and provide solutions.
- Do Failure Management with grace
- What runbooks are for change are playbooks for failure: well-defined procedures for such cases. Additionally, we explain blameless post-mortems and create awareness for chaos engineering and game days.
- Backups?! Are they in place, and if so, are they encrypted, and do you regularly test playing the backups back in? Most clients make the first two points. However, they never tested the third. We tell them about the Gitlab.com database incident in 2017, where 5 out of 5 backup mechanisms did not work, and they had to use a 6-hour old backup from a staging database.
- We mention Disaster Recovery (DR) and the different types and define which one is the most suitable one with a balance between costs and RPO & RTO. With DR, it is the same as with backups: it only works if you regularly test it!
Based on the pillar principles and improvement process, our conclusion is:
- KPIs business values need to be in place. However, we see lots of clients are not aware of them.
- Most clients are unaware of service quotas, only if they hit them. Furthermore, back-off strategies and proper timeouts of service calls are not in place.
- CI/CD pipelines are in place; however, rollbacks are not.
- Load testing: for example, you can use a prebuilt solution from AWS to generate load on your application and use AWS Aurora's cloning feature to have a copy of your production data in a pre-prod environment.
- Backups and DR are sometimes implemented. However, they are not tested regularly or at all.
- Generally, we encourage many to test for failure with principles such as chaos engineering.
Take care, and the final words are: we're happy to perform an AWS Well-Architected Review with you and tackle those issues together.