DEV Community

Site Reliability Engineering

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos

Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos

Comments
3 min read
Retry Pattern: Handling Transient Failures in Distributed Systems

Retry Pattern: Handling Transient Failures in Distributed Systems

Comments
3 min read
[pt-BR] Postmortem: A Importância de uma Análise Estruturada de Incidentes em SRE

[pt-BR] Postmortem: A Importância de uma Análise Estruturada de Incidentes em SRE

Comments
4 min read
Rely.io October 2024 Product Update Roundup

Rely.io October 2024 Product Update Roundup

Comments
4 min read
Internal Developer Portals: Autonomy, Governance and the Golden Path

Internal Developer Portals: Autonomy, Governance and the Golden Path

1
Comments
15 min read
SRE Culture Embedding Reliability into Engineering Teams

SRE Culture Embedding Reliability into Engineering Teams

Comments
3 min read
Procedimentos como base sólida da experiência do desenvolvedor antes da automação

Procedimentos como base sólida da experiência do desenvolvedor antes da automação

6
Comments
2 min read
SRE Deployment Engineer Managing Reliable & Automated Deployments

SRE Deployment Engineer Managing Reliable & Automated Deployments

3
Comments
4 min read
7 Kubernetes Security Best Practices in 2024

7 Kubernetes Security Best Practices in 2024

5
Comments
3 min read
SRE vs DevOps: What’s the Difference and Why Does It Matter? 🤓

SRE vs DevOps: What’s the Difference and Why Does It Matter? 🤓

Comments
1 min read
Rely.io September 2024 Product Update Roundup

Rely.io September 2024 Product Update Roundup

1
Comments
4 min read
Best Practices for Choosing a Status Page Provider

Best Practices for Choosing a Status Page Provider

Comments
5 min read
Why would I use this instead of Traefik for zero-downtime deployment?

Why would I use this instead of Traefik for zero-downtime deployment?

3
Comments
6 min read
Designing a fault-tolerant etcd cluster

Designing a fault-tolerant etcd cluster

7
Comments 1
5 min read
🚀 Day 8: Mastering Shell Scripting in DevOps | Bash Challenge

🚀 Day 8: Mastering Shell Scripting in DevOps | Bash Challenge

10
Comments 1
2 min read
[pt-BR] Como expandi o armazenamento da minha pasta /home com Block Storage

[pt-BR] Como expandi o armazenamento da minha pasta /home com Block Storage

Comments
4 min read
How to Set up Disk and Bandwidth Limits in Docker

How to Set up Disk and Bandwidth Limits in Docker

3
Comments
2 min read
K8s Plugins For Solid Security

K8s Plugins For Solid Security

Comments
2 min read
What are Kata Containers?

What are Kata Containers?

Comments
2 min read
Zero-Downtime Blue-Green Deployment with a Simple 'git pull & bash run.sh' Command

Zero-Downtime Blue-Green Deployment with a Simple 'git pull & bash run.sh' Command

1
Comments
1 min read
DynamoDB: Query x Scan! Para de torrar dinheiro usando Scan em produção

DynamoDB: Query x Scan! Para de torrar dinheiro usando Scan em produção

38
Comments 6
4 min read
How to Fix Kubernetes Node Disk Pressure

How to Fix Kubernetes Node Disk Pressure

Comments
2 min read
Some of the less-known ping types you should know

Some of the less-known ping types you should know

6
Comments 1
1 min read
How a Pod is Deleted - Behind the Scenes Breakdown

How a Pod is Deleted - Behind the Scenes Breakdown

8
Comments 2
2 min read
How To Fix OOMKilled

How To Fix OOMKilled

1
Comments
2 min read
Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

Comments
7 min read
The “R” in MTTR: Repair or Recover? What’s the difference?

The “R” in MTTR: Repair or Recover? What’s the difference?

Comments
5 min read
SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

Comments
5 min read
SRE and the Enterprise: Building a Culture of Reliability at Scale

SRE and the Enterprise: Building a Culture of Reliability at Scale

Comments
4 min read
DevOps vs. SRE Understanding the Differences and Benefits

DevOps vs. SRE Understanding the Differences and Benefits

Comments
2 min read
How to Define Engineering Standards (with Backstage)

How to Define Engineering Standards (with Backstage)

Comments
10 min read
The Pillars of Site Reliability Engineering Building Resilient Systems

The Pillars of Site Reliability Engineering Building Resilient Systems

Comments
2 min read
Introducing Botkube Fuse: The Platform Engineer’s Copilot

Introducing Botkube Fuse: The Platform Engineer’s Copilot

6
Comments
4 min read
DevOps

DevOps

1
Comments
1 min read
Accelerating Business Growth with a Platform Engineering Team

Accelerating Business Growth with a Platform Engineering Team

Comments
5 min read
When Alerts Don’t Mean Downtime - Preventing SRE Fatigue

When Alerts Don’t Mean Downtime - Preventing SRE Fatigue

Comments
2 min read
System Reliability Metrics: A Comparative Guide to MTTR, MTBF, MTTD, and MTTF

System Reliability Metrics: A Comparative Guide to MTTR, MTBF, MTTD, and MTTF

Comments
10 min read
The Pulse Of Technology: Why IT Monitoring Is Non-Negotiable In 2024

The Pulse Of Technology: Why IT Monitoring Is Non-Negotiable In 2024

Comments
13 min read
How to improve DORA metrics as a release engineer

How to improve DORA metrics as a release engineer

5
Comments
10 min read
𝗧𝗵𝗲 𝗖𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴

𝗧𝗵𝗲 𝗖𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴

1
Comments
1 min read
SRE and the Enterprise: Building a Culture of Reliability at Scale

SRE and the Enterprise: Building a Culture of Reliability at Scale

Comments
4 min read
Understanding the 0.6-Second Detection Time for Full Outages

Understanding the 0.6-Second Detection Time for Full Outages

8
Comments
3 min read
How To Reduce The Alert Noise For Optimal On-Call Performance

How To Reduce The Alert Noise For Optimal On-Call Performance

Comments
10 min read
The Cornerstones of SRE: SLI, SLO and SLA

The Cornerstones of SRE: SLI, SLO and SLA

Comments
4 min read
Datadog : how to filter metrics on tag "team"

Datadog : how to filter metrics on tag "team"

1
Comments
3 min read
Do You Need All That Support Levels After All?

Do You Need All That Support Levels After All?

3
Comments
7 min read
AWS Observability Maturity Model - V2

AWS Observability Maturity Model - V2

9
Comments
5 min read
Context is all you need.

Context is all you need.

1
Comments
1 min read
Enhance Your System Reliability with These Top Log Monitoring Tools

Enhance Your System Reliability with These Top Log Monitoring Tools

Comments 1
2 min read
CrowdStrike Incident: 5 Key Lessons for DevOps & IT Teams

CrowdStrike Incident: 5 Key Lessons for DevOps & IT Teams

1
Comments
5 min read
Implementing SLOs in Microservices: A Comprehensive Guide to Reliability and Performance

Implementing SLOs in Microservices: A Comprehensive Guide to Reliability and Performance

1
Comments
9 min read
Cold Storage: A Deep Dive into the Frozen Vaults of Data

Cold Storage: A Deep Dive into the Frozen Vaults of Data

2
Comments
11 min read
Configurando o Terraform para funcionar corretamente com o LocalStack

Configurando o Terraform para funcionar corretamente com o LocalStack

Comments
3 min read
Implementing SLO Error Budget Monitoring with AWS Services Only

Implementing SLO Error Budget Monitoring with AWS Services Only

3
Comments 2
5 min read
Synchronize Files between your servers

Synchronize Files between your servers

Comments
3 min read
Static Site Generation

Static Site Generation

Comments
4 min read
Advanced Incident Management Strategies for Engineers

Advanced Incident Management Strategies for Engineers

Comments
11 min read
Role of Human Oversight in AI-Driven Incident Management and SRE

Role of Human Oversight in AI-Driven Incident Management and SRE

Comments
10 min read
14 Monitoring Tools for Full-Stack Developers

14 Monitoring Tools for Full-Stack Developers

1
Comments
7 min read
The Benefits of a Single Incident Management System

The Benefits of a Single Incident Management System

Comments
2 min read
loading...