DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Alert Fatigue is Breaking DevOps: Here is the Math

Alert Fatigue is Breaking DevOps: Here is the Math

Comments
2 min read
Chapter 6 — Sagas & Compensating Transactions: Building “Retryable Conversations”

Chapter 6 — Sagas & Compensating Transactions: Building “Retryable Conversations”

1
Comments
7 min read
Telemetry Debt Is Not “Missing Logs” — It’s Missing Proof

Telemetry Debt Is Not “Missing Logs” — It’s Missing Proof

Comments
6 min read
The Old Guard vs. The New Way: Traditional Infrastructure Management vs. Modern DevOps

The Old Guard vs. The New Way: Traditional Infrastructure Management vs. Modern DevOps

Comments
5 min read
How to Design a DevOps Monitoring Strategy That Actually Works

How to Design a DevOps Monitoring Strategy That Actually Works

Comments
3 min read
PORT VS SOCKET

PORT VS SOCKET

1
Comments
3 min read
Why your developers hate your internal tooling (and how to fix it)

Why your developers hate your internal tooling (and how to fix it)

Comments
2 min read
Your Identity System Is Your Biggest Single Point of Failure

Your Identity System Is Your Biggest Single Point of Failure

1
Comments
5 min read
Why Nobody Completes Postmortem Action Items (and How to Fix It)

Why Nobody Completes Postmortem Action Items (and How to Fix It)

1
Comments
1 min read
Your AI Agent Is Available, Fast, and Making Terrible Decisions

Your AI Agent Is Available, Fast, and Making Terrible Decisions

1
Comments
6 min read
Quiet Failures: Why Modern Systems Drift Into Outages (and How to Catch Them Early)

Quiet Failures: Why Modern Systems Drift Into Outages (and How to Catch Them Early)

1
Comments
5 min read
Hosted control plane: when it simplifies operations and when it adds complexity

Hosted control plane: when it simplifies operations and when it adds complexity

Comments
11 min read
Chaos by Design: Production Maintenance Drills on Kubernetes

Chaos by Design: Production Maintenance Drills on Kubernetes

2
Comments
5 min read
OpenTelemetry: the one instrumentation standard to rule them all

OpenTelemetry: the one instrumentation standard to rule them all

1
Comments
2 min read
Trust Is an Engineering Output: How Teams Earn Credibility When Systems Break

Trust Is an Engineering Output: How Teams Earn Credibility When Systems Break

2
Comments
5 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.