CodingBlocks
Software Reliability Engineering – Hope is not a strategy
It’s finally time to learn what Site Reliability Engineering is all about, while Jer can’t speak nor type, Merkle got one (!!!), and Mr. Wunderwood is wrong.
The full show notes for this episode are available at https://www.codingblocks.net/episode181.
Survey Says
So, DevOps is a culture, but SRE is a job title?
- Wait, what?
- Yeah, I get it.
- Meh.
Reviews
Thanks for the review “Amazon Customer”! (You, er, we know who you are.)
Site Reliability Engineering
- Site Reliability Engineering: How Google Runs Production Systems is a collections of essays, from Google’s perspective, released in 2016 … and it’s free. (sre.google)
- There’s a free workbook to go along with it too. (sre.google)
- But how is SRE as a career? (GlobalDots.com)
- Career Advancement Score (out of 10): 9
- Median Base Salary: $200,000
- Job Openings (YoY growth): 1,400+ (72%)
- These essays are what one company did, that company being Google.
- The book is told from the perspective of people within the company.
It is about scaling a business process, rather than just the machinery.
Site Reliability Engineering: How Google Runs Production Systems
- Their tale should be used for emulating, not copying.
- 40-90% of your effort is after you have deployed a system.
- The notion that once your software is “stable”, the easy part starts is just plain wrong.
- Yeah, but what is a Site Reliability Engineering role?
- It’s engineers who apply the principles of computer science and engineering to the design and development of computing systems, usually large distributed ones.
- It includes writing software for those systems.
- Including building all the additional pieces those systems need, i.e. backups, load balancers, etc.
- Reliability … the most fundamental feature of any product?
- Software doesn’t matter much if it can’t be used.
- Software need only to be reliable “enough”.
- Once you’ve accomplished this, you spend time building more features or new products.
- SRE’s also focus on operating services on top of the distributed computing systems. Examples include:
- Storage,
- Email, and
- Search.
- Reliability is regarded as the primary focus of the SRE.
- The book was largely written to help the community as a whole by exposing what Google did to solve the post deploy problems as well as to help define what they believe the role and function is for an SRE.
- They also call out in the book that they hope the information in the book will work for small to large businesses. Even though they know small businesses don’t have the budget and manpower of larger businesses, the concepts here should help any software development shop.
However, we acknowledge that smaller organizations may be wondering how they can best use the experience represented here: much like security, the earlier you care about reliability, the better.
Site Reliability Engineering: How Google Runs Production Systems
- It’s less costly to implement the beginnings of lightweight reliability support early in the software process rather than introduce something later that’s not present at all or has no foundation.
- Who was the first SRE? Maybe Margaret Hamilton? (Wikipedia)
- The SRE way:
- Thoroughness,
- Dedication,
- Belief in the value of preparation and documentation, and
- Awareness of what could go wrong, and the strong desire to prevent it.
Hope is not a strategy.
Site Reliability Engineering: How Google Runs Production Systems
Chapter 1 – Introduction
- Consider the sysadmin approach to system management:
- The sysadmins run services and respond to events and updates as they happen.
- Teams typically grow as the capacity is needed.
- Usually the skills for a product developer and a sysadmin are different, therefore they end up on different teams, i.e. a development team and an operations team (i.e. the sysadmins).
- This approach is easy to implement.
- Disadvantages of the sysadmin approach:
- Direct costs that are not subtle and are easy to see.
- As the size and complexity of the services managed by the operations team grows, so does the operations team.
- Doesn’t scale well because manual intervention with regards to change management and process updates requires more manpower.
- Indirect costs that are subtle and often more costly than the direct costs.
- Both teams speak about things with different vocabularies (i.e. no ubiquitous language from back in the DDD days).
- Each team has different assumptions about risk and possibilities for technical solutions.
- Each team has different assumptions about target level of product stability.
- Direct costs that are not subtle and are easy to see.
- Due to these differences, these teams usually end up in conflict.
- How quickly should software be released to production?
- Developers want their features out as soon as possible for their customers.
- Operations teams want to make sure the software won’t break and be a pain to manage in production.
- How quickly should software be released to production?
- A developer always wants their software released as fast as possible.
- An ops person would want to minimize the amount of changes to ensure the system is as stable as possible.
- This results in trench warfare between the two groups!
- Operations introduces launch and change gates, such as test for every problem that’s ever happened.
- Development teams introduce fewer changes and introduce more feature flags, such as sharding the features so they’re not beholden to the launch review.
What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team.
Site Reliability Engineering: How Google Runs Production Systems
Google’s Approach to this Problem?
- Focus on hiring software engineers to run their products (not sysadmins).
- Create systems to accomplish the work that would have historically been done by sysadmins.
- SRE can be broken down into two main categories:
- 50-60% are Google software engineers, that is people who were hired via the standard hiring procedure.
- 40-50% are candidates who were very close to the Google software engineer qualifications but didn’t quite make the original cut.
- Additionally, they had skills that would be very valuable for SRE’s but not as common in typical software engineers, like Unix system internals and networking knowledge.
- SREs believe in building software to solve complex technical problems.
- Google has tracked the progress career-wise of the two groups and have found very little difference in their performance over time.
- Software engineers get bored by nature doing repetitive work and are mentally geared towards automating problems with software solutions.
- SRE teams must be focused on engineering.
- Traditional ops groups scale linearly by service size, hiring more people to do the same tasks over and over.
- For this reason, Google puts a 50% utilization cap on SRE’s doing traditional ops work.
- This ensures the SRE team has time to automate and stabilize the software through means of automation.
- Over time, as the SRE team has automated most of the tasks, their operations workload should be reduced to minimal amounts as the software runs and heals itself.
- The goal is that the other 50% of the SRE’s time is on development.
- Only way to maintain those rates is to measure them.
- Google has found that SRE teams are cheaper than traditional ops teams with fewer employees because they know the systems well and prevent problems.
… we want systems that are automatic, not just automated.
Site Reliability Engineering: How Google Runs Production Systems
Challenges
- Hiring is hard and the SRE role competes with product teams.
- Pager duty!
- Requires developer skills as well as system engineering.
- This is a new discipline.
- Requires strong management to protect the budgets, such as stopping releases, respecting the 50% rules, etc.
One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.
Site Reliability Engineering: How Google Runs Production Systems
Tenants of SRE
- Availability
- Latency
- Performance
- Efficiency
- Change Management
- Monitoring
- Emergency Response
- Capacity Planning
Durable Focus on Engineering
- In order to keep time for project work, SREs should receive a maximum of 2 events per 8-12 hour on-call shift.
- This low volume allows the engineer to spend adequate time for accuracy, cleanup, and postmortem.
- More than events that mean you have a problem to solve or more SREs to hire, less and you have too many SREs.
- Postmortems should be written for all significant incidents, whether paged or not.
- Non-paged work might be even more important since it can point to a hole in the monitoring.
- Cultivate a blame-free postmortem culture.
Max Change Velocity
- An error budget is an interesting way to balance innovation and reliability.
- Too many problems and you need to slow down and focus more on reliability, not enough problems and you’re probably gold plating.
- Ever have a manager push back on tech-debt? Maybe they aren’t aware of this balance? What can you do to quantity it?
- 100% uptime is generally considered to not be worth it, as gets more expensive as you get closer to the mark and your customers generally don’t have 100% uptime, so it’s wasteful.
- What is the right reliability number though? That’s a business decision.
- What downtime percentage will the users allow, based on their usage of the product?
- How critical is your service? Is there a workaround?
- How well does the experience degrade?
- What could a team do if there’s not anymore room in the budget?
- What if there’s too much?
Monitoring
- Monitoring is how to track the system’s health and availability.
- Classic approach was to have an alert get sent when some event or threshold is crossed.
- This is flawed though because anything that requires human intervention is by it’s very definition, not automated and introduces latency.
- Software should be interpreting and people should only be involved when the software can’t do what it needs to do.
- Classic approach was to have an alert get sent when some event or threshold is crossed.
- Three types of valid monitoring:
- Alerts – a person needs to take immediate action.
- Tickets – a person needs to take action but not immediately. The event cannot automatically be handled but can wait a few days to be resolved.
- Logging – nobody needs to do anything. The logs should only be viewed if something prompts them to do so.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR).
Site Reliability Engineering: How Google Runs Production Systems
Emergency Response
- The best metric for determining effectiveness of an emergency response is the MTTR, i.e. how quickly things got back into a healthy state.
- People add latency. Even if there are more failures, a system that can avoid emergencies that require people to do something, will still have higher availability.
- Thinking through problems before they happen and creating a playbook resulted in 3x improvement in MTTR as opposed to “winging it”.
- On call SRE’s always have on-call playbooks while also doing exercises they dub the Wheel of Misfortune to prepare for on call events.
Change Management
- 70% of outages are due to changes in a live system.
- Best practices:
- Progressive rollouts,
- Quickly and accurately detecting problems, and
- Ability to rollback safely when something goes wrong.
- Removing people from the loop, the practices above help improve release velocity and safety.
Demand Forecasting and Capacity Planning
- Forecasting helps you ensure service availability and keep costs in check and understood.
- Be sure to account for both organic growth, i.e. normal usage, and inorganic growth, such as launches, marketing, etc.
- Three mandatory steps:
- Accurate organic forecast, extending beyond the leadtime for adding capacity,
- Accurate incorporation of inorganic demand sources, and
- Regular load testing.
Provisioning
- The faster provisioning is, the later you can do it.
- The later you can do it, the less expensive it is.
- Not all scaling is created equally. Adding a new instance may be cheap but repartitioning can be very risky and time consuming.
Efficiency and Performance
- Since SRE are in charge of provisioning and usage, they are close to the costs.
- It’s important to maximize resources, which fundamentally affect the success of the project.
- Systems get slower as load is added, and slowness can also be viewed as a loss of capacity.
- There is a balance between cost and speed. SREs are responsible for defining and maintaining SLOs.
Resources we Like
- Links to Google’s free books on Site Reliability Engineering (sre.google)
- Why is SRE Becoming 2021’s Hottest Hire? (GlobalDots.com)
- How much money do SREs make? (Gremlin.com)
- Margaret Hamilton (software engineer) (Wikipedia)
Tip of the Week
- Don’t reinvent the wheel, if you’re in Java. Guava is a collection of utilities that solve common problems, courtesy of Google. (GitHub)
- From the mindset of RTFM: There are some interesting flags you can pass for
git cherry-pick
… and other tools you might use. (git-scm.com) - You can use
CTRL+NUM
on Windows orCMD+NUM
on macOS to navigate between tabs in Chrome. (support.google.com)