DEV Community

Alec Dutcher
Alec Dutcher

Posted on

Appendix: Reliability (Failure Management) - AWS Well-Architected Framework Study Guide

Return to Well-Architected Framework Guide

Appendix: Reliability

How do you back up data?

  • Identify and back up all data that needs to be backed up, or reproduce the data from sources
  • Secure and encrypt backups
  • Perform data backup automatically
  • Perform periodic recovery of the data to verify backup integrity and processes

How do you use fault isolation to protect your workload?

  • Deploy the workload to multiple locations
  • Automate recovery for components constrained to a single location
  • Use bulkhead architectures to limit scope of impact

How do you design your workload to withstand component failures?

  • Monitor all components of the workload to detect failures
  • Fail over to healthy resources
  • Automate healing on all layers:
  • Use static stability to prevent bimodal behavior
  • Send notifications when events impact availability

How do you test reliability?

  • Use playbooks to investigate failures
  • Perform post-incident analysis
  • Test functional requirements
  • Test scaling and performance requirements
  • Test resiliency using chaos engineering
  • Conduct game days regularly

How do you plan for disaster recovery (DR)?

  • Define recovery objectives for downtime and data loss
  • Use defined recovery strategies to meet the recovery objectives
  • Test disaster recovery implementation to validate the implementation
  • Manage configuration drift at the DR site or region
  • Automate recovery

Return to Well-Architected Framework Guide

Top comments (0)