Originally published on Failure is Inevitable.
When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams. A common misconception is that SRE “replaces” QA. People believe SLOs and other SRE best practices render the traditional role of QA engineering obsolete, as testing and quality shift left in the SDLC. This leads to QA teams resisting SRE adoption.
But QA teams can and should embrace the transformation that SRE can bring, as SRE elevates their role to a strategic partner in designing performant software and scalable practices. SRE removes silos from QA expertise, better aligning QA and engineering teams. Also, better prioritization and automation reduces the amount of toil QA teams face. In this blog post, we’ll break down how SRE transforms the role of QA, and highlight the improvements it brings for the team.
In his book Implementing Service Level Objectives, Alex Hidalgo explains how SLO implementation can affect QA. He describes six stages that we’ve summarized here:
- The engineers hear they have free reign, as long as they stay within the error budget. They start to skip the QA team’s processes.
- Now that their deployment has been sped up, engineers deploy too much code too quickly
- The error budget is exhausted before they know it. The error budget policy is triggered, perhaps including a code freeze.
- The engineers refocus their efforts on improving the deployment process. This can involve better monitoring, automating rollbacks, slowing down rollouts, setting up canaries.
- The code freeze is lifted and deploying resumes. Sometimes the backlog of deployments wipes out the error budget again immediately. Other times deployment improvements will slow the burn rate. Engineers look for a way to break the cycle.
- The engineers return to the old QA functions. Things like presubmit tests, dry runs, and traffic replay are added to the deployment cycle again. Working with the QA team, they rebuild a library of useful QA steps.
By reframing QA steps in the context of an error budget, you prove that each step is impactful. Engineers won’t see these tests as onerous because they allow engineers to keep writing new code.
SRE teaches us that failure is inevitable. There will always be bugs and edge cases that QA cannot account for. You need to prioritize testing efforts and design tests to cover the most impactful areas. SLIs, or service level indicators, can help you identify them.
Error budgets and SLOs are based on SLIs. SLIs are based on the areas of your service that have the highest customer impact. When considering the value of a QA test, SLIs can provide very valuable context. Here is a process to evaluate QA tests with an SLI:
- Look at the monitorable data that each SLI is based on. This data can include latency of various services, the amount of traffic received, and more.
- Find tests that look at these metrics. These should be tests that look at the effect code has on the metrics. Consider the maximum possible impact the test could show the code having on each metric.
- Consider the impact that this maximum change would have on the error budget (AKA the customer happiness) once the metrics are consolidated.
This allows you to see the worst case scenario that the test could prevent. If there’s little potential impact to the error budget, consider removing the test from your arsenal. If you cannot connect a test to an SLI, it’s possible that you could be running more focused, impactful tests.
Error budgets also allow you to design new tests. Look at major bugs that significantly depleted your error budget. Review incident retrospectives to see exactly where the bug originated. Consider what types of tests would have caught the bug before production. Better yet, given the impossibility of perfectly reproducing production scenarios in staging environments, build practices that enable you to safely test in production.
As you adopt SRE best practices, the actual function of testing is often adopted by the engineering team. QA then becomes responsible for the overall design and direction of testing. By using SLOs and SLIs, the goals of development and QA become more aligned and tests become more efficient.
Another way that SRE reduces the toil of QA is through automation. The SRE mentality is to automate wherever possible. QA teams have also always advocated for automated testing, but SRE elevates these practices in several ways.
An automated runbook is an SRE tool that provides a list of checks and steps for different circumstances. SREs automate their runbooks step by step, reducing the cognitive load on engineering. QA testing can also be formatted as a runbook or playbook. Instead of having each test be a standalone object, each step can be isolated and standardized. This library of steps can then be combined into new tests. As you automate, the steps become useful in a variety of situations.
In order to use this runbook model of testing, QA must be integrated into many areas within development and operations. The QA function shouldn’t be a siloed, black-boxed area of your organization. There must be more communication between teams than code going in and test results coming out. QA needs to work alongside development to understand their goals throughout the process.
By adopting SRE best practices, teams will develop this more strategic, integrative relationship. As engineering teams begin testing their own code, they’ll collaborate with QA to build testing runbooks. These runbooks will be able to draw from the necessary contexts and perform the necessary actions to automate fragile, manual processes.
As Alex points out in 8Implementing Service Level Objectives*, QA teams may be concerned about losing their place within an organization. Alex emphasizes that QA skills and experience will be even more important.
Instead of executing tests in a silo, QA engineers will become involved in designing and directing testing through development.
- They’ll be empowered to communicate how engineers should implement testing in their processes.
- In developing policy around error budgets, they’ll come to the table with unique and valuable perspectives.
- By drawing on their experience of discovering bugs, they’ll be able to anticipate and build strategies to prevent potential sources of error budget depletion.
Alex also describes a cultural shift that occurs as QA is integrated into engineering. He says that “QA teams are often seen by engineers as ‘no’ teams or ‘roadblocks.’” QA is often “caught in the middle of the friction between engineering and operations.” But with the SRE adoption, QA is elevated “from second-class roadblock to first-class partner.” Amy Tobey echoes this sentiment in a panel with Blameless. She suggests that SRE can “uplift” traditional QA teams by empowering them in “owning and nurturing the test spectrum, but extending that all the way out into production.”
The cultural lessons of SRE are centered around empathy and blamelessness. Instead of blaming individuals, incidents are viewed as opportunities, and people are encouraged to collaborate in addressing socio-technical challenges and improving resilience. A similar mentality applies to QA and engineering. Rather than testing specific pieces of code, QA can work with engineering on a systemic level to promote blamelessness.
Blameless can help you transform QA with our SLO and error budgeting tools. We take a vendor-agnostic approach and focus on the process of operationalizing SLOs in context of other key reliability practices such as incident resolution, incident retrospectives, and more.. To see how, check out our webinar on SLOs.
If you enjoyed this blog post, check out these resources: