DEV Community

Karthi
Karthi

Posted on

Observability, Monitoring and then SRE

Context

Breaking monolic applications into microservices architecture offers many benefits, it also creates complexity. Microservices need to communicate with one another, and each individually created and updated component must work with other components, with a minimum of latency. So when managing an application composed of microservices, we also manage a network of interrelated components. Effective management of that network is essential to overall reliability. Microservice environments require significant coordination, insight, and care to be able to monitor and measure requests passing through the system.

Unknown Unknowns Vs Known Unknowns

Monitoring communicates that something is wrong. Through a regular collection of metrics and alarms are raised when metrics hit a certain condition, such as response times that cross the upper threshold. Observability came to prominence on the heels of Monitoring—distributed systems started to overtake the monolith pattern.

An issue with the Monitoring approach with modern stacks—it’s difficult to track down an error within the microservice architecture when requests traverse multiple (and often ephemeral) services. Monitoring tends to answer known unknowns or the questions owners have prepared themselves to answer.

On the other side, Observability aims to address the unknown unknowns—The questions that service owners haven’t prepared or haven’t even realized they’d want answered.

In distributed systems, or in any mature, complex application of scale built by good engineers, the majority of production incidents trend towards the unknown-unknowns.

Image description

Monitoring requires you to know what’s important to monitor in advance. Observability lets you determine what’s important by watching how the system performs over time and asking relevant questions about it.

Observability is the ability to measure the internal states of a system by examining its outputs. A system is considered “observable” if the current state can be estimated by only using information from outputs.
observability uses three types of data — metrics, logs and traces — to provide deep visibility into distributed systems and allow teams to get to the root cause of a multitude of issues and improve the system’s performance.

A viable observability platform offers a number of features that help engineers quickly find the source of incredibly complex problems when they occur and also proactively find ways to improve production systems in highly distributed environments.

Reliability matters

The reliability of your systems, services, and products are crucial to your success and the success of your organization. This reality is something we all know as people in Operations, IT and IT management, DevOps, SRE, or even as a developer responsible for the creation of software.

Site reliability engineering roles and responsibilities are crucial to the continuous improvement of people, processes and technology within any organization. Whether your team has already taken on a full-blown DevOps culture or you’re still attempting to make the transition, SRE offers numerous benefits to speed and reliability.
Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems.

An SRE’s biggest role is to improve the overall resilience of a system and provide visibility to the health and performance of services across all applications and infrastructure.

The origin of SRE & its components

While most DevOps and IT professionals are constantly focused on improving the development process, a large number of teams don’t focus on their systems in production. But, the vast majority of application and infrastructure costs are incurred after deployment. It stands to reason that development teams need to spend more time supporting current services. In order to reallocate their time without impeding velocity, SRE teams are forming – dedicating developers to the continuous improvement of the resilience of their production systems.

The core responsibilities of SRE teams normally fall into these categories:
1) Availability

Availability is the term for the amount of time a device, service or other piece of IT infrastructure is usable.

2) Performance

As teams gain maturity in SRE and availability becomes less erratic, they can start to focus on improving service performance metrics like latency, page load speed and ETL.

3) Monitoring

In order to identify performance errors and maintain service availability, SRE teams need to see what’s going on in their systems. Naturally, the SRE team is assigned the great task of implementing monitoring solutions.

5) Preparation

The continuous improvement of monitoring, incident response and the optimization of service availability and performance will inherently lead to more resilient systems. At the end of the day, SRE teams build the foundation for a more prepared engineering and IT team. With the monitoring resources provided by the SRE team, the development and IT team can deploy new services quickly and respond to incidents in seconds.

Golden signals of SRE

Golden signals are critical for ops teams to monitor their systems and identify problems.These signals are especially important as we move to microservices and containers, where more functions are spread more thinly, including 3rd parties.

There are many metrics to monitor, but industry experience has shown that
From the Google SRE book: Latency, Traffic, Errors, Saturation
Latency- Time is taken to search a request
Traffic- Stress from demand on the system
Errors- Rate of request that is falling
Saturation- Overall capacity of the service.
Other industry standards are
USE Method (from Brendan Gregg): Utilization, Saturation, Errors
Utilization: the average time that the resource was busy servicing work
Saturation: the degree to which the resource has extra work which it can't service, often queued
Errors: the count of error events

RED Method (from Tom Wilkie): Rate, Errors, and Duration
Rate - the number of requests, per second, you services are serving.
Errors - the number of failed requests per second.
Duration - distributions of the amount of time each request takes.

The golden signals also give teams a single pane of glass view into the health of all services – whether they’re maintained by that specific team or not. Instead of disparate monitoring across every feature or service, you can roll all monitoring metrics and logs into a single location.

Effective monitoring will not only lead to improved incident management but it will improve the entire incident lifecycle over time.Implementing SRE and the four golden signals of monitoring will improve cross-functional visibility and collaboration, bringing IT operations and developers together.

References:
https://www.splunk.com/en_us/data-insider/what-is-observability.html
https://engineering.procore.com/observability-basics/

Top comments (0)