Our friends at Pulumi have a great post: Read Every Single Error. Instead of fancy tools, they just route 500 server errors to a Slack channel, and the on-call SRE person spends an hour triaging them! Incredibly simple.
And effective: the error rate is 17x less than the previous year!
Go read the above article. They're very clear about:
- very limited Site Reliability Engineer (SRE) time, on a daily and yearly basis
- by triaging errors they can scale the company exponentially
In my practice I always recommend starting focus on Quality vs always increasing Speed. In Quality discussions Engineers and the Business can verify everyone's moving towards the same goal. Sometimes the Business doesn't care about a bunch of hard technical work, so Engineers can not bother with a test suite. By discussing Quality everyone can agree on Business Consequences for technical actions, and tolerance for errors.
Pulumi decided that "reduce server errors" was a valuable business/tech focus, and they're enjoying the business consequences of that investment.