DEV Community

Discussion on: Are you the lonely DevOps engineer doing 24/7 on-call? Change it!

Collapse
 
tonymet profile image
Tony Metzidis

A good process is to create an ownership matrix listing out each service & software component, its SLA and the owner for the component.

You can establish internal SLAs for error rates, latency and use those as your health indicators.

Service SLA Owner
User Reg P50 < 500ms Bob
Search P50 < 250ms Mary
Backup error rate < .1% Paul

Then use an alerting system like pagerduty route those alerts to the service owner.

Eventually you will want to have the service owner be first responder so they become accountable for outages.

It's important to do this gradually and work closely with the team as you transition into delegating this responsibility. Communicating the long term plan and transition milestones is helpful here. You will get pushback.

Make sure that the service owner has appropriate authority to remediate -- e.g. access to logs, access to terminate instances , debug etc.

Then review the past months tickets to make sure that the rate of delegation is improving -- and ideally the overall ticket rate is going down.

If things aren't moving in the right direction, set up a committee with the leads and start doing RCA review of recurring issues. Ideally fires should be unforeseen issues not recurring failures.