TL;DR notes from articles I read today.
- Incidents retrospectives are an integral part of any good engineering culture.
- Often, too much focus is on triggers for the incident. The retrospective should instead review the timeline of incidents, remediation items and find owners for the remediation items.
- Retrospectives should be used as an opportunity for deeper analysis into systems (both people and technical) and assumptions that underlie these systems.
- Finding remediation items should be decoupled from the retrospective process. It helps participants to be free in conducting a deeper investigation as they are unburdened from finding any shallow explanations quickly.
- It’s a good practice to lighten up the retrospective template you are using because any template will be unequipped to capture unique characteristics of varied incidents. Also, sticking rigidly to a template means limits open-ended questions that can be quite useful in evolving your systems in the right direction.
Full post here, 6 mins read
- Myth #1 is that you will experience fewer incidents if you implement an observability strategy - Just implementing a strategy has no impact on the number of event occurrences but having it in place means that when a problem arises, you will have enough telemetry data to quickly solve it.
- Myth #2 is that getting an observability tool is a good strategy - Having an observability platform is not sufficient on its own. Unless observability becomes core to your engineering efforts your company culture, no tool can help.
- Myth #3 is that implementing observability is cheap. As observability is a core part of any modern tech infrastructure, you should think of your observability budget as a percentage of your overall infrastructure budget. The value derived from a good observability program in terms of efficiency, speed, and customer satisfaction surpasses the costs it incurs.
Full post here, 4 mins read
- You can use sampling APIs by way of instrumentation libraries that let you set sampling strategies or rates. For ex, Go’s runtime.setCPUProfileRate, which lets you set CPU profiling rate.
- Subcomponents of a system may need different sampling strategies, and the decision can be quite subjective: for a low-traffic background job, you might sample every task but for a handler with low latency tolerance, you may need to aggressively downsample if traffic is high, or you might sample only when certain conditions are met.
- Consider making the sampling strategy dynamically configurable, as this can be useful for troubleshooting.
- If collected data tracks a system end to end and the collection spans more than one process, like distributed traces or events, you might want to propagate the sampling decision from parent to child process through the header passed down.
- If collecting data is inexpensive but transferring or storage are, you can collect 100% of the data and apply a filter later to minimize while preserving diversity in the sample, retaining edge cases specifically for debugging.
- Never trust a sampling decision propagated from an external source; it could be a DOS attack.
Full post here, 4 mins read