Discussion on: Are you the lonely DevOps engineer doing 24/7 on-call? Change it!

View post

A good process is to create an ownership matrix listing out each service & software component, its SLA and the owner for the component.

You can establish internal SLAs for error rates, latency and use those as your health indicators.

Service	SLA	Owner
User Reg	P50 < 500ms	Bob
Search	P50 < 250ms	Mary
Backup	error rate < .1%	Paul

Then use an alerting system like pagerduty route those alerts to the service owner.

Eventually you will want to have the service owner be first responder so they become accountable for outages.

It's important to do this gradually and work closely with the team as you transition into delegating this responsibility. Communicating the long term plan and transition milestones is helpful here. You will get pushback.

Make sure that the service owner has appropriate authority to remediate -- e.g. access to logs, access to terminate instances , debug etc.

Then review the past months tickets to make sure that the rate of delegation is improving -- and ideally the overall ticket rate is going down.

If things aren't moving in the right direction, set up a committee with the leads and start doing RCA review of recurring issues. Ideally fires should be unforeseen issues not recurring failures.