Reliability Best Practices - AWS Well-Architected Framework Study Guide

Foundational requirements are those whose scope extends beyond a single workload or project
It’s the responsibility of AWS to satisfy the requirement for sufficient networking and compute capacity
Service quotas (aka service limits) exist to prevent accidentally provisioning more resources than needed and to limit request rates on API operations to protect services from abuse
Monitor and manage these quotas for all workload environments
Ask:
- How do you manage service quotas and constraints?
- How do you plan your network topology?

SDKs take the complexity out of coding by providing language-specific APIs for AWS services
Distributed systems rely on communications networks to interconnect components, such as servers or services
Workload must operate reliably despite data loss or latency in these networks
Components must operate in a way that does not negatively impact other components
Ask:
- How do you design your workload service architecture?
- How do you design interactions in a distributed system to prevent failures?
- How do you design interactions in a distributed system to mitigate or withstand failures?

Anticipate and accommodate changes to achieve reliable operation
Changes include those imposed on your workload (i.e. spikes in demand) and those from within (i.e. feature deployments and security patches)
Monitor the behavior of a workload and automate the response to KPIs
Ask:
- How do you monitor workload resources?
- How do you design your workload to adapt to changes in demand?
- How do you implement change?

Be aware of failures as they occur and take action to avoid impact on availability
Take advantage of automation to react to monitoring data
Regularly back up your data and test your backup files
Test failure response on a regular schedule and ensure that such testing is also triggered after significant workload changes
Actively track KPIs, as well as the recovery time objective (RTO) and recovery point objective (RPO)
Ask:
- How do you back up data?
- How do you use fault isolation to protect your workload?
- How do you design your workload to withstand component failures?
- How do you test reliability?
- How do you plan for disaster recovery (DR)?