I was reading this paper published by Google on the research they did in 2019 into how Google debugs production issues.
Some interesting points:
- SREs(Site Reliability Engineers) and SWEs(Software Engineers) have different approaches to debugging incidents. SWEs are more likely to look at logs, whereas SREs follow a more generic approach to incident response looking at common failure patterns across service health metrics.
- Depending on the experience level (@Google), people used different tools. Newer engineers tended to use newer fancier tools while the engineers who had been around for a long time stuck to the legacy tools they trusted.
- Underlying causes for incidents (excluding security and data correctness issues), resonates with what I have seen at Booking.
- Code changes
- Configuration Changes
- Dependency Issues
- Infrastructure Issues
- External Traffic issues.
- The building blocks of a typical debugging journey are
- They mention that mitigations often makes things worse. In many of the incidents that I was part of @ Booking, were cases where local mitigations caused a much bigger outage inadvertently. Again resonates extremely well with my experience at Booking. They also mention that most investigations are breadth first searches both into what systems/underlying systems could cause the issue and what is exactly wrong. They go on to describe some incident stories with most the best case scenario and another scenario where the tooling failed the user. Key lesson: Do not rollout when there is an ongoing incident
- Key principles to mitigate service problems faster
- Establish SLOs and accurate monitoring
- Triage effectively to find the blast radius and who you need to communicate with.
- Mitigate early.
- Apply established strategies for common issues (Errors, Performance,Capacity)
- Know your dependencies,
- Debugging microservices requires adequate service architecture documentation and the ability to traverse the stack quickly.
By the way, you should follow me on twitter