Andreas Wittig

Posted on May 16, 2019 • Originally published at marbot.io

Are you the lonely DevOps engineer doing 24/7 on-call? Change it!

#devops #oncall #sre

Are you the only one in your team who takes responsibility for the productive system? Are you carrying your laptop with you even in your free time to be able to fix issues in production? Are you unofficially on-call 24/7?

I've been in the same situation. Being the lonely DevOps engineer - even if you are part of a bigger team - can be a burden.

But how to make the change from a one-person show to an on-call team performance like from the picture book? Here are some ideas on how to change your situation.

Pair Programming and Pair Debugging

Team up with one team member when programming or debugging.

Share your screen and explain what you are doing to your colleague. Ask your colleague for help to avoid mistakes and to find better solutions.
Guide your colleague through making changes to the infrastructure from his/her machine. Don't forget to discuss the "why".
Watch your colleague and let her/him explain what she/he is doing to you. Give valuable feedback, but only from time to time.

Repeat the process with all of your team.

Safe Learning Environment

Learning how to operate a complex cloud infrastructure is scary for the rest of your team. It is critical to take away your colleagues fear of breaking production. Make sure to grant the whole team access to a safe learning environment. For example, an AWS account that is only used to try oneself. Even better, provide a separate AWS account to all colleagues.

Infrastructure and Operations Documentation

Invest in creating and updating documentation of your cloud infrastructure and operations. Doing so may not be your favorite job, but it's necessary. Observe the questions from your colleagues and improve the documentation accordingly.

Illustrate the high-level architecture with a diagram. Lucidchart and Cloudcraft are my favorite tools to create architecture diagrams.
Illustrate the network topology with a figure.
Describe the different parts of your architecture.
Describe your backup and recovery strategy.
Explain where to find monitoring metrics, alarms, and logs.

Show & Tell

Are you planning a significant change in production? Did you improve monitoring or logging? Spread the knowledge and organize a Show & Tell meeting. Thirty minutes should be excellent. Don't forget to reserve 10 minutes for questions from your colleagues.

Runbook

Being on-call for a production system leaves your team with a queasy feeling. It takes some time to build the confidence of being able to fix any problem. Support your colleagues by providing runbooks guiding them through localize and fix common issues.

A runbook should answer the following questions:

How to categorize the severity of the incident? For example, by pointing to relevant metrics or logs.
How to localize the root cause of the failure?
How to fix the root cause of the incident?

Check out our runbook "ALB UnHealthyHostCount" runbook as an example.

Blameless Postmortems

When handing over responsibility for production to your team, the incidents caused by human failure will increase. Set a good example. Don't blame for human failure. Organize blameless postmortems instead. Help your team to learn from failure. Don't forget to sensitize your management as well.

Praise On-Call

Appreciate colleagues who are doing on-call and take responsibility for production.

Praise the colleague who completed her/his first weekend or night on-call shift.
Praise the colleague who takes over an extra on-call turn from a sick colleague.
Award the "on-call engineer of the month" based on the number of fixed incidents.
Provide a day off for colleagues who excelled oneself during their on-call shifts.

Or think of other gamification that fit your team spirit. Make sure to get support from management for appreciating colleagues doing on-call shifts.

Summary

Are you the lonely DevOps engineer doing 24/7 on-call? Change it! There are no one-size fits all solution. But no one besides you will drive the change.

Are you a lonely DevOps engineer? I want to connect. Please contact me!

Top comments (1)

Tony Metzidis • May 17 '19

A good process is to create an ownership matrix listing out each service & software component, its SLA and the owner for the component.

You can establish internal SLAs for error rates, latency and use those as your health indicators.

Service	SLA	Owner
User Reg	P50 < 500ms	Bob
Search	P50 < 250ms	Mary
Backup	error rate < .1%	Paul

Then use an alerting system like pagerduty route those alerts to the service owner.

Eventually you will want to have the service owner be first responder so they become accountable for outages.

It's important to do this gradually and work closely with the team as you transition into delegating this responsibility. Communicating the long term plan and transition milestones is helpful here. You will get pushback.

Make sure that the service owner has appropriate authority to remediate -- e.g. access to logs, access to terminate instances , debug etc.

Then review the past months tickets to make sure that the rate of delegation is improving -- and ideally the overall ticket rate is going down.

If things aren't moving in the right direction, set up a committee with the leads and start doing RCA review of recurring issues. Ideally fires should be unforeseen issues not recurring failures.