Originally published on Failure is Inevitable.
With complex architectures, gaining visibility into systems is becoming more difficult. Additionally, with the move to remote work, it’s more important than ever before to adapt to new modes of work such as asynchronous collaboration. So how do we adjust to these changing times?
In a CIO panel hosted by Lightspeed Venture Partners, industry experts came together to discuss these questions. Panel members included:
- Ashar Rizqi, CEO and Co-founder at Blameless
- Raj Dutt, CEO and Co-founder at Grafana Labs
- Kelsey Waters, Senior Director of Operations at Packet
Below are key insights from their conversation.
Shifting the concept of technical debt
Companies fear technical debt, and often view tackling technical debt as a major project that impedes other work. But in a healthy system, tech debt management should be less of a root canal, and more like a routine cleaning. “Part of the vision of modern software is that technical debt isn’t seen as an event, but rather as an ongoing non-event like burning down bugs,” Ashar said.
Kelsey agreed. She noted, “The concept of debt is a total misnomer. If you think of it as debt, then you'll think of it as an enemy to velocity. The reality is, if you view it as a regular course of operational business, it supports velocity.”
By taking the fear out of tech debt, companies can use it as an opportunity to support innovation while still optimizing for a reliable foundation.
Building a culture of resiliency
Resiliency isn’t only built into systems. It’s built into culture as well. It’s important to make sure your teams are safe psychologically and able to learn from mistakes.
“The next outage or major incident is right around the corner, and that's okay. That's the mindset that we need to be in,” Ashar said.
COVID-19 has accelerated the urgency of digitization, which has exacerbated the cost of digital disruptions such as incidents. This creates a tremendous amount of pressure and stress on socio-technical systems, including the people and teams who support digital services. This stress can lead to blame language. However, blame should be directed at the systems — such as guardrails and availability of data — rather than people.
As Ashar puts it, “Blame Jarvis, don't blame Tony Stark.”
Visibility empowers psychological safety. With observability, teams can gain deep insight into distributed systems and better understand gaps in tools and processes, moving away from the blame game.
Focusing observability efforts early on
Observability allows you to see into issues and learn from them in depth. Without an observability strategy, teams are left in the dark. To prevent this, Raj encourages organizations to focus on observability early and revise often. He reiterated the importance of finding the right tooling (open source or otherwise).
“With so many tools, vendors, and places where the right piece of data may be hiding, you want to create a seamless experience that allows people to switch between metrics, logs, and traces to get to the root of an issue.”
He also recommends that teams avoid vendor lock-in. “Own your observability strategy. Make the vendor and open source choices that fit your business needs. Do not let a single vendor dictate what that is to you.”
To begin implementing observability, Ashar also had some words of advice. He encourages teams to identify critical user journeys, then break them down into P1, P2, and P3 user flows. Teams must recognize the most crucial flows and implement observability based on these. For teams looking for a place to start, consulting with QA can help. It’s likely that many of these user flows are already documented.
But what if you’re trying to implement observability later on? Kelsey had some insight on this. She said, “Think about the decisions you need to make and what data you need to drive those decisions. Then find a source for that data and display it in a tool that's fit for your purpose.”
She also warned people of the pitfall of being data rich and information poor. To combat this, she suggested that teams do regular assessments of data and their observability approach, and embracing refactoring when needed.
As she said, “Regardless of an early versus corporate environment, try to make it easy for people to refactor and change. Support them in doing that. This will result in a reliable and observable environment with high resilience.”
Documenting and communicating
As recent events have pulled forward remote work, Kelsey described the need to support seamless collaboration through process and tooling. “In a very remote world, the way we communicate, interact, document, and make information available is important.”
With many people working asynchronously, documentation serves as much-needed context. It is essential to write things down, and even more so when team members can no longer simply meet and hash things out in a conference room.
Yet, even with documentation, teams still need tooling to help with collaboration. Selecting the right ones for your team is essential. Apps like Slack, Zoom, Blameless, and more can help teams stay on the same page even if they are no longer in the same office.
What are emerging trends you're seeing in the DevOps and SRE space. How are they impacting investments in technology resources and practices?
Open source observability
One emerging trend Raj mentioned was open source software within the observability space. As he noted, “It used to be that open source tooling was the cheap alternative to commercial and SaaS vendors. Now, companies are choosing open source not for cost savings, but because that's where the cutting-edge capabilities exist.”
This also helps companies avoid vendor lock-in.
Adaptability and avoiding vendor lock-in
As times change, so does technology. Companies need the freedom and flexibility to adapt. One way to make sure you’re evolving with the times is to reassess how well your process and tooling meets changing needs.
As Kelsey stated, “There's no one tool, no panacea that I can choose and walk away… I need something that can be agnostic and handle complexity. I think operating in this complex world means taking frequent and regular stock of how data is aggregated, how information is presented, and how our process and tools work together.”
Ashar added that we should take the same approach in thinking of our architectures. In his experience, it's rare to see a true absolute pure microservices implementation. As a company grows, the architecture may start taking on some monolithic characteristics. He believes that this is normal, and we should design and plan around it rather than fight against it.
Eliminating cognitive toil
As Ashar said, “What we don't capture is the impact of cognitive toil. Whether it's on the developers or the operator, it’s not going away. What you're likely seeing is the cognitive burden of operating software shifting. Sometimes it'll shift to an ops team or from the ops team to the dev team. Part of your strategy needs to include an aspect of what you’re doing to improve and reduce that cognitive toil.”
As system complexity increases, so does cognitive toil. Teams are beginning to scratch the surface of cognitive toil costs. Ashar believes that teams should focus on important business problems rather than toil. One way to do this is to focus on automation strategies, such as runbooks, chatbots, or other automation tooling.
AIOps for the future
Kelsey also acknowledged the emerging trend of AIOps. “The investment to get closer to enabling AIOps is valuable. There's incremental value along that path. If you want to do AIOps, what you're doing now is investing in monitoring, dashboarding, collating that information, and working on presenting it in an actionable way. Then, when someone launches effective AIOps, you can connect those things.”
But AIOps won’t be a blanket fix. “While AIOps is interesting, and it's something that we're investing in, it's place is to enhance the capabilities of SRE. It's not a panacea,” Raj said.
What have you learned in the last few months, and how will it continue to affect you?
Heightened risk for burnout
With new strategies and trends, it only makes sense that there will be new lessons learned along the way. Panelists also discussed earnings from COVID-19 on keeping systems up and running. One thing at the top of everyone’s mind was burnout.
As Ashar said, “Teams are working longer hours with fewer breaks and there's a collaboration toil that isn't accounted for. We're seeing people hitting that burnout point. One of the things that we've done is to make sure that we account for burnout relief.”
He suggests leaders make sure that it’s not always about deadlines and pushing hard, or else teams might lose sight of the most important thing: the team itself. He advises teams to build burnout relief into their planning. Team members need to get the time and space away from the keyboard that they need to be productive.
Remote work and increased flexibility
We’ve learned many lessons about remote work and the power of adaptability in the last six months. One thing that stood out to panelists as a valuable lesson learned was about hiring.
Ashar noted, “What we saw is that the talent pool and the richness of talent that became available just shot through the roof when we no longer had geographical constraints... We optimized our hiring processes. Our hiring funnel and pipeline efforts are more expansive because this is the new norm.”
Additionally, it’s important to allow for flexibility. With schedules upended and teams across time zones, it’s near-impossible for everyone to work the same hours, and teams and managers must adjust accordingly.
“There's a revolution for asynchronous work, for finding what time works for people to be productive and effective. That demand is coming into every single workday. Figuring out how to manage that as a leader is super important,” Kelsey said.
Turning off and tuning out
With this flexibility comes challenges. One such challenge is leaving work, when now many of us essentially live at work. As team members tackle changing business priorities, log extra hours and postpone vacations, many people are spending more time in the office despite not actually being in the office.
Kelsey discussed the importance of stepping away from work, even for a day. “It's become increasingly important for us to encourage people to take a day. Go to your backyard, go to your parents' house, find somewhere where you feel safe and get a little bit of downtime.”
Learning to thrive in this post-COVID world is an act of resilience in all of us. With the wisdom shared by these three panelists, we can learn how to make our systems and people more capable of withstanding all the bumps along the way. By adopting observability early and revising often, and focusing on documentation, communication, and (above all else) people, we can enter this new era stronger than before.
If you enjoyed this article, check out these resources: