Cover image by MichaelKirsh
For those of you not keeping track, a single AWS region, US-East-1, suffered a major outage last week. Many of the teams affected were on holiday for Thanksgiving within the US, leaving many red-rimmed eyes by the time the issues resolved.
While there have been some critiques that absolutely hit the mark correctly, such as this from Forrest:
Forrest BrazealIn the wake of The Kinesis Incident, I’d love to see AWS commit to a full audit of their internal service dependency tree and related assumptions. This write up, great as it is, does not give confidence in the blast radius of future cascading failures being less severe. twitter.com/rchrdbyd/statu…14:18 PM - 28 Nov 2020Richard H. Boyd needs a new Mac for election data @rchrdbydA detailed review of Wednesday's outage has been published. https://t.co/B5sxIpH9nl
It's fair to say that a number of critics are 'punching above their weight' and don't really understand the complexities involved. While I certainly don't understand the ins and outs of this failure I think the follow things are pretty clear:
- a single failure, no matter how severe, does not mean you shouldn't be using a particular cloud
- there is no simple way to completely uncouple services, so there will always be issues of cross dependency
- you probably shouldn't host your status dashboard on your own services (does this mean the AWS status page should be hosted on Azure? yes.)