DevOps and SRE
DevOps is a set of practices that combine software development and operations.DevOps influences the application lifecycle throughout its plan, develop, deliver, and operate phases. Site Reliability Engineering (SRE) is a practical way to implement DevOps practices and principles.
SRE implements DevOps practices via SLI,SLO,SLA and error budgets.
Service Level Indicator is the quantitative measure of the level of service provided over a period.SLI are the metrics defined by the user journey for a service. Example Availability, Latency, Throughput etc.
Service Level Objective is the numerical targets that define the reliability of a system. SLO is measured using SLIs.
Service Level Agreement is the commitment that indicates the availability and reliability of the service meeting a certain level of expectation.
Error budget tells us how unreliable our service is.Error budget is 100% - SLO.
The DevOps lifecycle:
CI/CD is a key DevOps practice.
Continuous Integration:
A software development practice where all developers merge code changes in a central repository multiple times a day.Tools to help are cloud source repository, cloud build, artifact registry.
Continuous Delivery:
The practice of automating the entire software release process.
Tools to help are GKE, GKE on prem, Cloud Run.
What is observability?
Reliability is the most important feature of a service, and setting SLOs allow monitoring systems to capture how the service is performing.
System reliability is tracked by SLOs. SLOs require SLIs or specific metrics to monitor.
Monitoring is the process of collecting, processing, aggregating and displaying real time quantitative data about a system.
With monitoring one can understand the trends in application usage patterns which in turn helps in health checks of the system as well as diagonising when things go wrong.
Key areas of operations include gathering logs, metrics and traces.Dashboards for visualizations.Triggering alerts and error reporting.
Operations with tools such as cloud monitoring, cloud logging, error reporting and the application performance management with tools like Debugger, Profiler and Trace.
References:
Google cloud devops certification preparation with acloud guru.
Top comments (2)
I really like this article because it quickly and clearly communicates the importance of site reliability, and how keeping tabs on SLAs for example are not always as straight forward as it might seem.
We at SnowOwl.co are in beta to provide network observability down to the request level, in a convenient, serverless, low/no-code SaaS platform that sits at the edge. Some of our beta clients are using us for SLA uptime verfication, which has done well for them.
Thanks for the feedback!