This article is the first entry for SRE advent calendar 2022 in Qiita.
I'm a member of JAWS-UG (Japan AWS User Group) and hosting the SRE branch with cool mates.
As you know SRE stands for Site Reliability Engineering, which was born in Google and widely spreading as a practice for solving problems of system operations by SWE.
In AWS's W-A (Well-Architected Framework) you can also find similar philosophies to SRE's ones, especially in "Reliability Pillar." Let's dive deep into some best practices of it and get insights from SRE's point of view.
AWS has learned many practices for building cloud-native applications, in running huge cloud infrastructure for many years and providing these knowledges as a whitepaper.
This is Well-Architected Framework (known as W-A.)
W-A has 6 pillars.
- Operational exellence
- Performance efficiency
- Cost optimization
- Sustainability (newly added in 2021)
Site Reliability Engineering is a practice for solving issues in operating IT systems approaching from software engineering.
Google put it by publishing the first "SRE book," which consists of many lessons learned in experience of running huge infrastructure as a distributed system known as "Borg."
When you build your application with microservices architecture, you'd better make isolated contracts for each API. You can scale the application and teams building it, since each API has different goals, limits, and other considerations.
For example, you can use Amazon API Gateway for making API-based contracts with OpenAPI specification.
When we apply SRE practice in our applications, we may put SLO for them. In that case we can use this practice by setting SLO isolated in each API so that we can reduce toils for maintaining them. You don't have to sustain SLO too high when the API does not have strict requirement for running.
To avoid convergence in system trouble, you'd better design your application to behave as doing constant work even if in abnormal situation.
For example, if your application send irregular logs to clients only when it has detected unhealthy nodes, network traffic goes busy soon and easily cause traffic storm. You have to make it send complete logs despite all the nodes are healthy.
In addition to practices above, you should use "exponential backoff and jitter" for retries in error, not to face convergence.
The anti-patterns shown in this article are interesting.
- Implementing Auto Scaling groups for automated healing, but not implementing elasticity.
- Using automatic scaling to respond to large increases in traffic.
- Deploying highly stateful applications, eliminating the option of elasticity.
To scale your workload elastically and automatically, you must make it stateless. Server nodes consisting application were loved like our pets before, however now we'd better consider them as livestocks in modern application architecture.
When you update your application, do add new versions of nodes into its cluster instead of changing them in-place.
You can use deployment strategies such as "blue/green deployment" and "canary deployment." In AWS services, Route 53 has the weighted DNS routing useful for take the strategies.
If you use immutable infrastructure for your application, you can scale it easily and rollback it in short time when it is in trouble.
We can find many practices similar to SRE's ones in AWS W-A Reliability Pillar. I also recommend you to read Operational Excellence Pillar since it has many points in SRE essence, too :)