Hello there, my name is Pavel Pritchin, and I’m CTO at Dodo Engineering, part of Dodo Brands. My previous role was Head of SRE, and since then, the reliability of our IT system Dodo IS has been one of my responsibilities. Today I’d like to share which practices help us to ensure the stability of our system and even share some templates that anyone can use at their company.
Dodo Brands is a franchise business, and the IT system developed by Dodo Engineering team is provided as software as a service for partners. The head company covers the cost of development and maintains the stability of the Dodo information system (Dodo IS). We introduced the Service Level concept to ensure system reliability and set Service Level Objectives (SLOs). There are also processes in place to maintain reliability.
- SLOs for monitoring application errors (success rate),
- SLO for the number of bugs per product,
- SLO for the crash-free rate of mobile applications.
Different teams can add their custom stability goals. For example, someone may have a target value for release frequency.
It's not enough to set the objective; it must also be maintained. For example, the SLO for errors has an overall target SLO for the entire system of 99.9%.
The critical services have established SLOs, typically 99.99%. Each service has an owner team whose task is to monitor and fix stability issues. The screenshot shows a summary dashboard that allows the service owner team to write actions to fix the problem with their service. Even fixing minor deviations from the target value helps maintain the overall stability of the Dodo IS.
To maintain service level, it's not enough to analyze problems after the fact. It's also necessary to respond quickly to service failures. Here comes on-call process to deal with it.
Every development team is responsible for the services of its domain. Duty rotations include separate shifts during work hours, nonworking hours, and on weekends or holidays. For critical services, we implemented 24/7 duty shifts.
All system services are divided into three levels of criticality. They are called pools: A, B, and C. We have an escalation system outside of business hours for services in pools A and B. Pool C includes all other services regardless of their criticality. A, B, and C pools have their target MTTA, SLO, and compensation coefficients. We watch every service and are ready to fix any issue.
Each on-call engineer undergoes workshops and other training and has a deep knowledge of the services they work with every day. In case of a service failure, the on-call engineer should start to look into the problem within 5 minutes, and the incident management pipeline handles the situation. For their work on incidents and being on-call for critical services, on-call engineers receive various compensations.
Let's see how it works at night or on weekends (at the right part of the scheme). The monitoring signal comes to the 1st support line. 1st line decides whether the failure is critical or if these are minor fluctuations or flops alerts. If the issue is severe, 1st line escalates it to 1 of 8 on-call engineers in pools A and B. On-call engineer can call those on duty at that moment or get to people in another pool if there are problems in related systems and services. We developed an escalation system for cases where no one answers, so we can still find engineers to fix the issue as quickly as possible.
But once again: it's not enough to just find a capable engineer, find the reason for the incident, and fix the problem. If we don't analyze it and won't create and follow some plan to get rid of the root cause, it may happen again and again, bringing new problems to our business. That's why we use postmortems.
After incidents, a "postmortem" review always takes place. The practice of postmortems is used to identify the root cause of the problem. It's essential to identify systemic issues in the architecture and design of services and fix them. Without this, we cannot maintain the Service Level because the number of problems will increase over time.
One of the main difficulties in working with postmortems is conducting a qualitative analysis of what happened. For this, the structure of a template is essential. The template should lead us from describing facts to solving concrete decisions. Helping questions like "What helped us during the incident," "What went wrong" or "What went like clockwork" should push for deep insights. You also need general information specific to all failures: date, downtime duration, business loss in money, and affected services. General information allows you to do a meta-analysis and look at trends and tendencies.
Here you can find the template of postmortem which we use at Dodo after every incident: https://www.notion.so/dodobrands/Delivery-driver-app-doesn-t-work-Network-failure-2c315f993e324dddb9c37cd41ae1d291?pvs=4
You can also learn best practices of postmortem review from the authors of the practice in the SRE book from Google, just as we did: https://sre.google/sre-book/postmortem-culture/
As a result, stability support works in conjunction with processes:
- We are maintaining the Service Level of each service, fixing minor service issues.
- On-call duty and incident management process, during which we mitigate failures.
- A high-quality analysis of the causes of failures and their elimination. Sometimes fixing problems requires a lot of time and significant system changes.
Thus, we can guarantee the specified level of stability for the Dodo IS.
If you have any questions on how we work with SLO, on-call, and postmortems, feel free to contact me in comments or directly: we at Dodo Engineering are always happy to share our experience!