Originally published on Failure is Inevitable.
Downtime costs more than dollars. It also costs customer happiness and trust. So how do teams maximize for reliability while scaling? Tooling, communication, observability, and more all play into a complete reliability strategy.
In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed best practices for responding to incidents, scaling for reliability, and how to engineer with the customer in mind. The customer panel members included:
- John Fiedler, VP Engineering, Blameless
- Salman Bhatti, VP Engineering, Citrix
- Kelly Dodd, SRE, Greenlight Financial
- Renato Nascimento, SRE, Incognia
Below are a few key insights from their conversation.
- Scaling for reliability and security: Being able to fail over and blue-green deployments are key. Additionally, shifting reliability and security left in the SDLC play a big part.
- Balancing speed and reliability: Achieve this with a four-pronged approach focusing on platform stability, fast and small releases, observability, and service ownership.
- Leveraging IaC and focusing on building trust: Using IaC to standardize gives developers a clean field to build on. It’s also important to be transparent in both external and internal communications to build trust.
- Communication, postmortems, and action item tracking: As tech stacks evolve, it’s important that tooling helps you maximize for communication, learning, and prevention.
- Creating a single source for SLOs: It’s important to align on a tool to serve as the central location for SLIs and SLOs. SLOs help teams keep an eye on metrics. They also drive conversations around prioritization.
- Developing roles and runbooks: It’s important to harden roles and responsibilities to keep each teammate focused on the task at hand. These tasks should be described in comprehensive runbooks.
- Interpret your metrics and challenge assumptions: Metrics on their own don’t communicate customer needs. Teams must interpret these metrics to challenge assumptions.
What are some of the key initiatives that your teams are investing in for a high-scale, rapid product delivery lifecycle?
Scaling for reliability and security
As companies grow, it’s important to scale with demand. Salman Bhatti speaks on his experience with this at Citrix. “We're constantly investing in improving our ability to scale as our customers grow. The biggest thing right now is managing scale and reliability at the same time. How do you support a growing number of customers, especially during this timeframe where we've seen an explosive growth in the use of our products due to COVID-19?”
Citrix is achieving this by creating incident response procedures as well as blue-green deployments, a technique of running two identical production environments. “When we have problems, we don't have time to dig into the details and figure out what went wrong. We need to get back up and running as fast as possible. Being able to fail over and having a blue-green deployment are important. You don't have the luxury of time. We never had it before, but it's magnified now.”
Additionally, security is becoming an increasingly important initiative as Citrix grows. According to Salman, “Security becomes first and foremost when you've got people using a variety of endpoints to access their systems, your systems, and our customers’ systems.”
His team is focusing on shifting security left in the software development lifecycle. “Shifting left is now part of our development psychology. When you're building a system, is it inherently secure? That's something that we're thinking about now more so than ever before.”
Balancing speed and reliability
As companies grow, they also need to balance speed and reliability. Especially in a start-up environment, delivering value to customers at a fast pace is key. Kelly Dodd of Greenlight Financial is highly familiar with these pressures.
“Anyone who's been at a startup that's growing quickly understands that you have to deliver features fast. Sometimes this is top of mind and you might have to come back and address reliability problems later. It can be difficult to look at what you've been building and say, ‘How do I make it reliable? How do I start shipping code safely?’”
Her team is working hard at answering these questions. This is especially top-of-mind as her company becomes larger. “Everything we do as an SRE team is in pursuit of fast and reliable releases. We've had a period of intense growth over the past year at Greenlight Financial. Our engineering team tripled in size. That can give you an idea of how our product is growing and needs to scale.”
Kelly described the four main initiatives her team is focusing on to meet these scaling demands:
- Platform stability: Make sure that all services are running in similar environments. If you're using Kubernetes, everything's in Kubernetes. Terraform as much as you can or codify in another preferred way; the important thing is to prioritize consistency.. This uniformity helps de-risk the process around spinning up new environments.
- Fast and small releases: Invest in automated testing, as it is crucial for enabling continuous delivery. You can't manually test if you're shipping out every change to prod.
- Observability: Implement distributed tracing and other observability efforts, then check to see if it’s working by asking novel questions of your system and analyzing the accuracy of the results.
- Service ownership: Plug all this into on-call. When something breaks, make sure that someone who understands the product can tackle the problem — ideally the person who built the service.
Leveraging IaC and focusing on building trust
To achieve fast and reliable releases, it’s crucial to ensure environments are uniform. Renato Nascimento of Incognia noted this as well. “At Incognia, we’ve been focusing on Infrastructure as Code (IaC) to make sure our environments are uniform. They should have the same configuration, so developers are not playing on top of a flaky field.”
This is an important internal initiative; in parallel, Renato and his team at Incognia are also working on an equally important external initiative. The team is focused on making sure that external communication during incidents helps build and maintain trust with the customers.
“When customers know what's going on, they will feel informed and safe. We make sure that communication is clear. We also focus on making sure customers know that they can rely on us in terms of availability.”
John Fiedler, VP Engineering at Blameless, also agrees that communication is key. “You can never over-communicate. Internally, it’s important to learn from postmortems. Externally, having that engagement and communication with your customers is a cornerstone of any business.”
How has your technology stack evolved over time to tackle key challenges as you scale? How does Blameless fit into your team's charters?
Communication, postmortems, and action item tracking
MTTR is a common metric teams are focused on reducing. Kelly notes that her team is honing in on processes to reduce MTTR. “We always quickly address problems in production. Working in FinTech, it's really important. You can't just shrug off downtime,” she said.
The Blameless platform plays a key role in helping Greenlight get systems up and running faster. “The biggest change Blameless solves for us is communicating to other parts of the company. Other stakeholders in engineering all know where to go to see whether there's an incident happening, what's going on in the incident, and what they can do. We find that Blameless helps us convey key pieces of information during the incident to other team members. That helps us reach a resolution a lot faster.”
After an incident is resolved, it’s important to feed the learnings back into the software development lifecycle. Bugs that need to be fixed should be prioritized and addressed in a timely fashion. Reliability-related issues need to be accounted for in sprint planning. Kelly says Blameless helps her team with this as well.
“The postmortems and tracking action items in Blameless have been very useful for us. We look closely at whether we are completing the things we say that we will. The fact that Blameless ties into our JIRA ticketing system has been very useful for us for tracking action items and looking at the categories of incidents over time. This helps us understand the key problems that we're facing. You think you have an intuition about what the flakiest parts of your system are, but you need to look at the numbers to actually find out. Blameless has been great for that.”
Knowing which work to prioritize
One of the most challenging issues in software development is understanding when to ship new features versus shore up reliability. SLOs are essential to providing clarity in this decision process. They serve as a data-driven conversation starter. Renato spoke about the importance of SLOs to his organization.
“SLOs are a great fit for our stack because it’s not only about having the metrics, but also having conversations around the metrics and continuously testing and evaluating them. This tells us exactly where to focus our time. Sometimes your team imagines that a given metric impacts the customer, but once you look at the metric and evaluate your targets, you learn that that goal wasn't important at all. We can free that time and move to something else. We’re making sure that we are not investing time in things that don’t make a difference to our customers.”
John Fiedler also noted how SLOs have been helpful for driving conversations in his past experiences. “I spent the last two years working with [Salesforce] Einstein in machine learning trying to define SLOs. SLOs were an extremely powerful conversation starter with product and engineering teams.”
In addition to SLOs, Renato and team have also been using Blameless to prioritize compliance work. “We've always been a company that enjoys experimenting. But lately, we've been trying to get to know our tools and platforms more deeply to gain the level of compliance that we need. We are hitting markets with higher regulatory standards and levels of compliance. And the industry is moving towards a place where compliance is more important. That need drove us to Blameless.”
Helping large teams get on the same page
For larger organizations, tech stacks can often be highly fragmented, with disparate teams using not just different tooling but also different processes. Important data and context is often spread across these tools, making it difficult to aggregate. At Citrix, the team is working to make the tech stack more tightly integrated and unified.
“We've had an evolution that is partly due to acquisitions, partly due to employing different strategies, and partly to being early in the cloud. We want to drive centralization of these tech stacks.”
He describes two main benefits to standardization:
- Consolidating licensing: This drives down your cost and gross margins in licensing software, improving ROI of existing and future tooling investments.
- Lower cognitive toil: Team members can move from one area of the business to another with less cognitive toil. As the tooling for most teams is consistent, new team members can get onboarded faster.
Blameless is a key tool in accelerating Citrix’s tech stack centralization efforts is Blameless. Citrix began their journey with Blameless to help standardize the incident response procedures.
“[Before Blameless], incident response was done differently in each of the product lines. Some people use Slack for coordination. Other teams would use GoToMeeting. We used PagerDuty for paging. Previously, you'd have to hunt for the channel you're looking for and every team had a different way to do things. When you had an incident that hit multiple product lines or dependent platforms, you'd be lost. ”
With Blameless, the team has been able to streamline roles and responsibilities, and build resilience within the system.
“With Blameless, we set roles and responsibilities, and instituted a culture of blameless postmortems where everyone comes in and understands that something went wrong but it's nobody's fault. We just need to understand where it failed and how we can build resilience in the right areas to improve the overall system.”
John also noted the importance of learning how to get all team members on the same page during an incident. “I used to learn from firefighters how to do incident command. It was like a new word, and now it's really become a standardization. We have an amazing product built around this. It's really changed chaos to calmness.”
Salman agrees. “[Now with Blameless], when people are on a call, there's no more, ‘who's doing what? What's my role?’ That's really helped us get organized. We've saved that time, and that directly relates to our downtime. So there's a direct business value impact beyond just getting organized.”
Additionally, there's customer value in resolving incidents faster. Salman explains the importance of this. “We want to have really small outages. We don't want to have big bang outages with 20-30 minutes or even hours of downtime. But if you have 10 or 12 sub one-minute outages, it's not a big deal. That's a metric that we're looking to measure and drive. Blameless provides the upfront work streams that are going to help get us there.”
Integrating SLOs with incident response
Our panelists also discussed next steps with Blameless, and what they’re excited to see in the future.
Renato discussed the importance of connecting his team’s incident response process with SLOs. He’s looking forward to being able to create an incident from an SLO alert. Additionally, he’d like to see his team begin setting SLOs for other valuable operational metrics such as incident completion. “When you have the metric, that's good. That's a source for conversation. But how you're going to operationalize that metric is where we’d like to take SLOs in the future,” Renato said.
John agrees that SLOs are top of mind for him and other software executives he has spoken with. “We have had great internal conversations. I love how the concept of SLO really goes outside and beyond the SRE team. You're pulling in product and you're pulling in business executives and it becomes a business use case.”
Creating a single source for SLOs
Salman noted that with Citrix’s fragmented tech stack, it can be difficult to unite on ideas and key business metrics across teams. Additionally, with a large number of observability and dashboarding tools available, it’s difficult to get visibility into all the SLOs for a service.
“You can build SLI/SLO dashboards in many different places. You can build them in New Relic, you can build them in Splunk, and teams have done that. But what we need is a single pane of glass where all our services and how we're doing are measured the same way. This helps make sure that the goalpost for success is the same for all teams, as we don't want to hold teams at different bars of availability. This also gives the teams the autonomy to do their downstream work to help improve those SLOs.”
As such, Blameless’ SLO tool is also central to Citrix’s SLO strategy. “Having SLOs operationalized, consistent, and available in a centralized location to support multiple tech stacks is really key for us. We want to drive that partnership with Blameless to be able to give feedback on what we see and how we're using it day in, day out.”
Developing roles and runbooks
While Kelly and her team are also excited about SLOs, they’re more focused on incident management at this time. She discussed key areas of team focus that are critical to Greenlight’s operations strategy. “We need to work on hardening our incident roles. There's a lot of value in having, for example, a commander who's trained in that role, who knows that their role is to orchestrate and not necessarily to investigate.”
To help with this, runbooks are an important part of the strategy, one that Kelly and her team are looking forward to. “Runbooks are important for taking another step towards taking a human decision out of the incident process. You have written down what you need to do. You can see it in Blameless. Maybe it's tied into the type of incident that it is. Maybe even it's just a button that you click that does some kind of remediation. We're really excited about that next step.”
What metrics are most important for helping improve incident response and how do you know which map best to customer happiness?
Mapping to user journeys
It can be difficult to know exactly when an incident starts, and this can affect your MTTR metrics. One way to eliminate this confusion is to mark the start of the incident as the start of customer impact. That being said, understanding which metrics to measure for gauging the success of incident response can be difficult. Salman describes this journey.
“When we started doing MTTR and these types of metrics, we went a little too granular. We looked at a particular service and we started measuring how long it took for that service to come back and be up and running. But what we realized is that didn't necessarily mean that things were up and running for the customer. So instead, we switched to mapping critical user journeys and having those automated. That customer lens really helped. You need to have metrics or KPIs that are directly related to your customer experience. You're always looking to refine those and get better at it.”
He also notes that the worst way to have your incidents “start” is when your customers are calling you. Salman has been on teams that dealt with this challenge. “We would have some services where we didn't know our service was down until customer support started lining up and saying, ‘Hey, we're experiencing an issue here.’ That obviously identifies a gap in monitoring and alerting.”
To focus on alleviating customer pain, his teams at Citrix created a term called a “warm issue.” This is an issue where something's going wrong, and you receive alerts. It's not an outage yet. These issues veer away from the customer lens. They can also steer teams towards being a little bit too critical of their services. These “warm issues” never really turn into anything that a customer notices as far as degradation is concerned.
To avoid this, Salman recommends syncing the beginning of each incident with customer pain. “As soon as your customers start feeling pain, that's when you start the clock because that's when it matters to you the most, that's the right metric to measure. It's not obscured by anything else.”
Interpret your metrics and challenge assumptions
Renato and his team use metrics to challenge their assumptions about their system. “When we started to measure MTTR, we didn't have a grasp of what our timing was before. When you measure, you'll start to see things a lot clearer. Sometimes you have assumptions that aren't true. We thought our MTTR was lower than it actually was when we measured it.”
While measuring metrics like MTTR can be very helpful for creating a baseline, they aren’t perfect. Metrics are meaningless without humans to interpret them. Renato discussed his team’s views on this.
“We love metrics. It speaks to our souls. But metrics, for me at least, don't tell the whole story. So there are the quantitative aspects of it, but there are also the human aspects and interpretations. Metrics don't tell much if you don't interpret them.”
Kelly and her team also take a qualitative approach to metrics. In her experience, teams understand issues without needing metrics to confirm them. The most valuable insights into improvements actually come from postmortems. “Most real insights about what's hurting our incident response are brought up in this forum,” Kelly said. “The MTTR that we're measuring just serves as a way to prove to ourselves that we're improving, but I'm not sure it really drives it for us. It's more the other way round.”
Like Salman and Renato, she also vouches for focusing on customer impact as the basis of incidents. “This measurement is always going to be as close as you can get it. Is it one customer feeling it? Or is it 10% of your customers? Where do your alerts kick off? How much work do you put into determining that? Because what the customer is experiencing is what defines whether you're having an incident or not, that's when the MTTR starts.”
Salman, Kelly, Renato, and John’s discussion on reliability, scaling, and incident response had a lot of valuable insights, and clarified four key takeaways:
- Reliability is always feature No. 1. Without it, all other features are rendered useless.
- Metrics are important. The caveat is that metrics are only meaningful when people work together to interpret them.
- Incidents begin and end with the customer. MTTR is defined by the time it takes between customers experiencing an issue, and when the issue is resolved for the customer.
- Blameless can help. Blameless plays a key role in incident response, postmortems, and most importantly, conversations within all these teams.
If your team is looking to tackle similar challenges as Citrix, Greenlight Financial, and Incognia in resolving incidents faster, gaining insights into your metrics, and delighting customers, Blameless is here for you. Reach out to us for a demo or start your free trial today.
If you enjoyed this blog post, check out these resources: