DEV Community

loading...

Site Reliability Engineering

👋 Sign in for the ability sort posts by top and latest.
Talking a little bit about Ansible's loops

Talking a little bit about Ansible's loops

Reactions 5 Comments
4 min read
How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

Reactions 6 Comments
3 min read
Understanding the ABCs of CD

Understanding the ABCs of CD

Reactions 2 Comments
3 min read
How to Analyze Contributing Factors Blamelessly

How to Analyze Contributing Factors Blamelessly

Reactions 2 Comments
5 min read
Como continuar a execução de um build do Jenkins quando um stage falha

Como continuar a execução de um build do Jenkins quando um stage falha

Reactions 6 Comments
4 min read
A different approach working with Ansible variables

A different approach working with Ansible variables

Reactions 5 Comments
2 min read
Having On-call Nightmares? Runbooks can Help you Wake Up.

Having On-call Nightmares? Runbooks can Help you Wake Up.

Reactions 7 Comments
5 min read
How to Build an SRE Team with a Growth Mindset

How to Build an SRE Team with a Growth Mindset

Reactions 2 Comments
6 min read
Helm - Add some dynamism to your K8s deployment

Helm - Add some dynamism to your K8s deployment

Reactions 6 Comments
2 min read
Episode 3: To Boldly Debug

Episode 3: To Boldly Debug

Reactions 3 Comments
1 min read
SRE2AUX: How Flight Controllers were the first SREs

SRE2AUX: How Flight Controllers were the first SREs

Reactions 2 Comments
20 min read
Kubernetes Health Checks - 2 Ways to Improve Stability in Your Production Applications

Kubernetes Health Checks - 2 Ways to Improve Stability in Your Production Applications

Reactions 9 Comments
10 min read
How to: Pingdom super powered status sage

How to: Pingdom super powered status sage

Reactions 2 Comments
3 min read
Infracost diff - "git diff" but for cloud costs

Infracost diff - "git diff" but for cloud costs

Reactions 7 Comments
2 min read
It's all Chaos! And it Makes for Resilience at Scale

It's all Chaos! And it Makes for Resilience at Scale

Reactions 4 Comments
4 min read
How We Built and Use Runbook Documentation at Blameless

How We Built and Use Runbook Documentation at Blameless

Reactions 15 Comments 2
5 min read
SigNoz : Open-source alternative to DataDog

SigNoz : Open-source alternative to DataDog

Reactions 20 Comments 2
3 min read
Lessons from Slack, GCP and Snowflake outages

Lessons from Slack, GCP and Snowflake outages

Reactions 4 Comments
3 min read
Deep Dive into Docker Internals - Union Filesystem

Deep Dive into Docker Internals - Union Filesystem

Reactions 24 Comments
10 min read
How They SRE

How They SRE

Reactions 7 Comments 1
1 min read
Introduce Chaos Platform 2.0 for Azure

Introduce Chaos Platform 2.0 for Azure

Reactions 7 Comments
2 min read
What Is Nix and Why You Should Use It

What Is Nix and Why You Should Use It

Reactions 4 Comments
7 min read
How do you wrap your head around observability?

How do you wrap your head around observability?

Reactions 49 Comments 13
1 min read
Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Reactions 2 Comments
14 min read
Reliability as an Inseparable Part of Software Engineering

Reliability as an Inseparable Part of Software Engineering

Reactions 3 Comments
5 min read
Getting Started as an SRE? Here are 3 Things You Need to Know.

Getting Started as an SRE? Here are 3 Things You Need to Know.

Reactions 4 Comments
5 min read
Istio - Your next K8s must-have tool

Istio - Your next K8s must-have tool

Reactions 5 Comments
2 min read
The Key Differences between SLI, SLO, and SLA in SRE

The Key Differences between SLI, SLO, and SLA in SRE

Reactions 6 Comments
9 min read
How to Backup your Applications Data to S3 with Walrus

How to Backup your Applications Data to S3 with Walrus

Reactions 6 Comments
2 min read
Splunk - Calculate duration between two events

Splunk - Calculate duration between two events

Reactions 4 Comments
1 min read
What is the right AWS Kubernetes distribution for you?

What is the right AWS Kubernetes distribution for you?

Reactions 4 Comments
5 min read
The True Cost of Building your Own Incident Management System (IMS)

The True Cost of Building your Own Incident Management System (IMS)

Reactions 2 Comments
5 min read
Communication Tool Down? Here are 3 Ways to Handle it

Communication Tool Down? Here are 3 Ways to Handle it

Reactions 3 Comments
5 min read
GCP DevOps Certification - Day Ten

GCP DevOps Certification - Day Ten

Reactions 4 Comments
3 min read
Quick Survey: IT on-call experience in an "Always-On" world

Quick Survey: IT on-call experience in an "Always-On" world

Reactions 5 Comments 2
1 min read
Azure Front Door: An Overview

Azure Front Door: An Overview

Reactions 3 Comments
3 min read
Managing health checks at scale

Managing health checks at scale

Reactions 6 Comments
5 min read
"I'm Just Doing my Job," An SRE Myth

"I'm Just Doing my Job," An SRE Myth

Reactions 3 Comments
5 min read
Executando AWS cli em múltiplas contas de maneira fácil

Executando AWS cli em múltiplas contas de maneira fácil

Reactions 6 Comments
3 min read
What is a microservice catalog?

What is a microservice catalog?

Reactions 2 Comments 1
5 min read
Top Observability tools for DevOps Engineers and SREs

Top Observability tools for DevOps Engineers and SREs

Reactions 12 Comments
7 min read
Kubernetes gone bust. Now what?

Kubernetes gone bust. Now what?

Reactions 6 Comments
4 min read
Localizer: An adventure in creating a reverse tunnel/tunnel manager for Kubernetes

Localizer: An adventure in creating a reverse tunnel/tunnel manager for Kubernetes

Reactions 5 Comments
8 min read
How Kyverno helps with policy management

How Kyverno helps with policy management

Reactions 2 Comments
3 min read
Argo CD

Argo CD

Reactions 6 Comments
2 min read
AWS Project: Deploying a Static Website to AWS

AWS Project: Deploying a Static Website to AWS

Reactions 4 Comments
1 min read
The Engineer's Guide to Preparing for Black Friday 2020

The Engineer's Guide to Preparing for Black Friday 2020

Reactions 2 Comments
8 min read
Choosing SLOs that users need, not the ones you want to provide

Choosing SLOs that users need, not the ones you want to provide

Reactions 6 Comments
6 min read
Blameless Book Club: Implementing Service Level Objectives, Part 1

Blameless Book Club: Implementing Service Level Objectives, Part 1

Reactions 6 Comments
7 min read
Debugging incidents in Google's Distributed Systems

Debugging incidents in Google's Distributed Systems

Reactions 1 Comments
2 min read
Resilience Engineering and Life

Resilience Engineering and Life

Reactions 3 Comments
4 min read
Testing ML incident detection using a cloud native microservices app

Testing ML incident detection using a cloud native microservices app

Reactions 11 Comments
10 min read
Operational Readiness Review Template

Operational Readiness Review Template

Reactions 6 Comments
7 min read
Google Down worldwide | Why is Google Down? Let's break it down

Google Down worldwide | Why is Google Down? Let's break it down

Reactions 15 Comments
4 min read
SREview Issue #7 November 2020

SREview Issue #7 November 2020

Reactions 4 Comments
2 min read
Making Instrumentation Extensible

Making Instrumentation Extensible

Reactions 5 Comments
7 min read
SREview Issue #8 December 2020

SREview Issue #8 December 2020

Reactions 4 Comments
2 min read
Challenges with Implementing SLOs

Challenges with Implementing SLOs

Reactions 3 Comments
11 min read
How to SRE without an SRE on your team

How to SRE without an SRE on your team

Reactions 3 Comments
10 min read
Honeycomb SLO Now Generally Available: Success, Defined.

Honeycomb SLO Now Generally Available: Success, Defined.

Reactions 6 Comments
7 min read
loading...