A good process is to create an ownership matrix listing out each service & software component, its SLA and the owner for the component.
You can establish internal SLAs for error rates, latency and use those as your health indicators.
Service
SLA
Owner
User Reg
P50 < 500ms
Bob
Search
P50 < 250ms
Mary
Backup
error rate < .1%
Paul
Then use an alerting system like pagerduty route those alerts to the service owner.
Eventually you will want to have the service owner be first responder so they become accountable for outages.
It's important to do this gradually and work closely with the team as you transition into delegating this responsibility. Communicating the long term plan and transition milestones is helpful here. You will get pushback.
Make sure that the service owner has appropriate authority to remediate -- e.g. access to logs, access to terminate instances , debug etc.
Then review the past months tickets to make sure that the rate of delegation is improving -- and ideally the overall ticket rate is going down.
If things aren't moving in the right direction, set up a committee with the leads and start doing RCA review of recurring issues. Ideally fires should be unforeseen issues not recurring failures.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
A good process is to create an ownership matrix listing out each service & software component, its SLA and the owner for the component.
You can establish internal SLAs for error rates, latency and use those as your health indicators.
Then use an alerting system like pagerduty route those alerts to the service owner.
Eventually you will want to have the service owner be first responder so they become accountable for outages.
It's important to do this gradually and work closely with the team as you transition into delegating this responsibility. Communicating the long term plan and transition milestones is helpful here. You will get pushback.
Make sure that the service owner has appropriate authority to remediate -- e.g. access to logs, access to terminate instances , debug etc.
Then review the past months tickets to make sure that the rate of delegation is improving -- and ideally the overall ticket rate is going down.
If things aren't moving in the right direction, set up a committee with the leads and start doing RCA review of recurring issues. Ideally fires should be unforeseen issues not recurring failures.