DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
The 60-Second Break-Glass Protocol: Hot-Patching Live Production Outages via Local Tunnels

The 60-Second Break-Glass Protocol: Hot-Patching Live Production Outages via Local Tunnels

Comments
11 min read
Instrumenting Legacy Code Without Rewriting It

Instrumenting Legacy Code Without Rewriting It

Comments
2 min read
I Let Claude Design 4 Chaos Experiments via MCP. The 4th Took Down Staging and Found a 6-Month-Old Bug.

I Let Claude Design 4 Chaos Experiments via MCP. The 4th Took Down Staging and Found a 6-Month-Old Bug.

1
Comments
11 min read
System Design - Availability & Reliability: What "99.9% Uptime" Really Means (And Why It's Not Enough)

System Design - Availability & Reliability: What "99.9% Uptime" Really Means (And Why It's Not Enough)

Comments
6 min read
Configure Audit Logging in Kubernetes

Configure Audit Logging in Kubernetes

Comments
4 min read
The 54-point production deployment checklist that saves you from 3am rollbacks

The 54-point production deployment checklist that saves you from 3am rollbacks

Comments
3 min read
What DDIA taught me about reliability

What DDIA taught me about reliability

Comments
1 min read
The Case for a Dedicated Reliability Engineer

The Case for a Dedicated Reliability Engineer

Comments
2 min read
Observability Telemetry and Predictive AIOps

Observability Telemetry and Predictive AIOps

Comments
8 min read
The 10 Commandments of Working in Production

The 10 Commandments of Working in Production

Comments
7 min read
Why I treat API timeouts as "unknown", not failures

Why I treat API timeouts as "unknown", not failures

Comments
1 min read
The Prometheus label that blew our monitoring bill out 6x

The Prometheus label that blew our monitoring bill out 6x

1
Comments
4 min read
How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

Comments
5 min read
API Rate Limiting: Patterns That Scale

API Rate Limiting: Patterns That Scale

Comments
2 min read
Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Comments
3 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.