Jose Cueto

Posted on Dec 23, 2022 • Edited on Dec 27, 2022

Site-Reliability Engineering - Service Monitoring Fundamentals

#sitereliabilityengineering #softwareengineering #devops #servicereliability

Note: Eventhough this write-up is prefixed with SRE, it's relevant to software engineering in general and in the business of designing and implementing reliable services.

The loose definition of service monitoring across organizations poses the risk of missing its fundamentals in the thought process of designing and implementing it which in turn increases the chance of service complexity.

This write-up defines service monitoring at its fundamental level including fundamental concepts that serve as its building blocks. Ultimately this write-up aims to provide organizational-agnostic first principles of service monitoring that can be used as primitive objects for the practical design and implementation of service monitoring strategies.

Disclaimer: This write-up is not in any way sponsored by any of the author’s affiliations. Its contents are based solely on the author’s industry experience and internet references.

What is Service Monitoring?

Before we can define service monitoring, we first have to explain the fundamental concepts around it such as the software service or service, observability, service states, service variables, and the concept of a service-user agnostic service modifier. Otherwise, we will end up with a non-atomic definition of service monitoring which is subject to many interpretations and defeats the purpose of this write-up. Furthermore, we define and discuss a particular type of monitoring called service alerting.

Service

At the fundamental level, a software service can be modeled by a finite-state machine or FSM, which has a set of states, a state transition function, a start state, a set of final states, and an input alphabet. We use an FSM because it can fully describe a service and it is a widely accepted model therefore not reinventing it. Finally, for this write-up, we will keep the FSM as simple as possible and therefore we only care about its states for now.

Definition: A service is a finite-state machine.

Service State

Given a service S, a set of service states exist that can happen in S’s lifetime.

The boundary between these service states is a logical one that is known and is of interest to a service user. Therefore the process of identifying it is left to the service user or a service modifier which will be defined in one of the succeeding paragraphs.

As an example, in a cloud storage service, a specific type of service user such as an engineer is interested in the UploadInProgress service state because he or she has the intent to monitor the upload performance of such cloud storage service. Such a service state is bounded by the logic behind a “progressing file upload” which is service-specific. In other words, the definition of the lifetime of the UploadInProgress service state is up to the engineer to define, however, there must be best practices that exist to avoid highly overlapping service states.

Service State Internals

Obviously, a service state is just a label for a state and behind it is a collection of variables. An UploadInProgress state is just a label for a collection of variables and their values:

uploadStarted = 1
uploadDone = 0
uploadError = 0

These variables are called service variables and will be defined further. As an example, the previous service variables are enough to describe an UploadInProgress service state.

Definition: A service state St is a set of service variables that collectively identifies St.

Service Variables

Service variables make up a service state and because they are variables their values can change during their parent service state’s lifetime.

Definition: A service variable is a mutable attribute of a service state.
Definition: A service state’s lifetime is the timeline between it started become active up to when it became inactive.

Definition: An active service state means it is happening at the moment.

Definition: An inactive service state is the opposite of an active state.

Service Modifier

Definition: A service modifier is any service actor that can cause a service’s state to transition to another service state, or cause a change of value in at least one of the service variables within a service state.

In practice, service actors such as a business customer, an engineer such as an SRE engineer or a software engineer, another service that operates on a service, or the service acting on itself are all considered service modifiers because they can cause a service state to transition to another service state, or cause a change of value in at least one of the service variables in a service state.

Note: It is essential to use a generic service user such as a service modifier because any perception of a service such as service reliability is subject to a type of service user. Instead of user the term “user” we use the term “modifier” to imply that it is acting on a stateful service.

Observable, Partially Observable, Hidden

Definition: A service variable can be observable which means it can be observed all the time by a service modifier, partially observable if there are instances where it is impossible to observe, or hidden if it is impossible to observe it in all of its instances in a service state.

Monitoring

In case you do not know, Google has definitions of Observability and Monitoring, and as you may have known these two concepts are two different things. However, these definitions do not draw a clear line between these two concepts. For example, the snippet “a technical solution that allows teams to actively debug their system” can be a monitoring task as well.

To help achieve our initial goal of atomically defining service monitoring, we need to revise Google’s definition into something more practical and well-scoped. This is not to say that Google’s definition is incorrect, it’s just not useful for this write-up’s purpose.

Definition: Monitoring is an automation strategy for inferring and presenting service states given a collection of service variables.

It’s therefore an automation strategy with the following end goals.

Infer service states
Present service states

Definition: Observability is an automation strategy for exposing service variable values to enable monitoring of a service.

It’s therefore an automation strategy with the following end goals.

Expose service variable values
Enable service monitoring

We use the term “automation strategy” here because the goals identified must all be achieved through software automation. In contrast, manually inferring service states such as running complex system commands would require a service modifier to determine combinations of these system commands, parse their output into something consumable, then finally interpret this output into a service state.

Service monitoring and observability automate all these manual methods so that their service modifiers can focus more on their core intents and ultimately gain value from them.

Using the cloud storage service example, a service modifier such as an engineer would need to know the following service states happened in order to know that there were no errors during the file upload;

UploadInProgress
U̶p̶l̶o̶a̶d̶F̶a̶i̶l̶e̶d̶
UploadFinished

If the goals above aren’t guaranteed, then it’s likely that automation is not a central purpose of monitoring or observability implementation. In this case, any service modifier can be subjective about the values that it brings.

Think Service States

So far we have defined the fundamentals of service monitoring and as of this stage, we can be strongly grounded by it when designing and implementing a service monitoring strategy.

For instance, the task of monitoring something in a service is not as simple as “querying that thing” it must mean that;

We are monitoring a defined service state.
We are monitoring a service state that has enough available service variables to identify it.
We are monitoring a service state that has no hidden service variables.

Additionally, the service state we are monitoring must be;

Accurately presented to a service modifier.

One important thing to notice is that observability is the bread and butter of monitoring. Therefore it must be reliable and sufficient in order for monitoring to be both cost-effective and cost-efficient.

Why Use Service Monitoring?

There are many specific reasons why we want to use service monitoring however for this write-up, we are interested in the reason for its core use case. The core intents of monitoring service can be summarized in the following statements.

We want to monitor a service because we want to;

Inspect its states (e.g. level 1, debugging)
React on interesting states (e.g. level 2, service alerting)
Analyze its states (e.g. level 3, offline or online service analytics)
Automated state analysis and reaction (e.g. level 4, smart monitoring)

As you can see these service monitoring intents are generic and they are arranged in order of complexity. The complexity here means more layers of abstraction and guarantees. For example, level 4 needs guaranteed high accuracy of service states and their compound analyses, otherwise, it won’t be possible to implement it.

Importance of State Presentation

The second goal of monitoring is as essential as its first goal. Presentation is not simply visualizing service states such as displaying fancy graphs. It is akin to designing and implementing exemplary software interfaces in that it achieves the following goals:

Encapsulate service variables from service modifiers
Present service states with service quality level guarantees
Provide a service state data platform

Thank you so much for reaching this part of this write-up, I hope you have learned something from it.

Service Alerting

A monitoring strategy that presents service states to a service modifier through an alert,

Definition: An alert is a message containing service state information that is sent to a service modifier through a notification system.

Service alerting must be as simple as its definition and because it’s basically a monitoring strategy then it inherits the goals and principles of monitoring we have discussed so far.

Asynchronous and Synchronous Monitoring

Two fundamental methods of monitoring as it is used by service modifiers are:

Definition: Asynchronous monitoring is a method of monitoring where a service modifier has full control of consuming presented service states.
Definition: Synchronous monitoring is a method of monitoring where a service modifier has partial control only of consuming presented service states.

An example of asynchronous monitoring is when an SRE engineer uses a monitoring system to monitor service states without a predefined condition perhaps for example due to a need to analyze a certain service state. In this case, an SRE engineer has full control of when to monitor a service state. On the other hand, an example of synchronous monitoring is obviously service alerting. In this case, an SRE needs to monitor a service state presented by an alert and react to it appropriately and therefore an SRE does not have full control of the monitoring condition.

Purpose of Service Alerting

There may be various reasons for service alerting at the implementation level, however, it only has one fundamental purpose and that is to achieve the following goals.

To automate monitoring of service states using a predefined condition.
To react with a certain level of urgency based on the presence of a service state.
The second goal may sound opinionated but the logic behind it boils down to adhering to the KISS principle.

If monitoring of a service state doesn’t need to be reacted upon then it only requires asynchronous monitoring or a monitoring view which is much simpler to implement than an alert that requires the right monitoring condition.

Definition: A monitoring view is a curated presentation of a service state.

In practice, there is more to determining the right monitoring condition, there is also determining the urgency level of an alert and the actions that need to be done for that alert.

That is all for now! Again thank you for reading this write-up and I hope to write more of these in the future. I was going to write more about service alerting guides or facts, however, it might easily diverge from this write-up’s main topic.

DEV Community