Monitoring vs observability -- is there even a difference and is your monitoring system observable?
Observability has gained a lot of popularity in recent years. Modern DevOps paradigms encourage building robust applications by incorporating automation, Infrastructure as Code, and agile development. To assess the health and "robustness" of IT systems, engineering teams typically use logs, metrics, and traces, which are used by various developer tools to facilitate observability. But what is observability exactly, and how does it differ from monitoring?
"Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs." --- Wikipedia
An observable system allows us to assess how the system works without interfering or even interacting with it. Simply by looking at the outputs of a system (such as logs, metrics, traces), we can assess how this system is performing.
One of the best explanations about monitoring and observability I've seen was provided in an online course, "Building Modern Python Applications on AWS", by Morgan Willis, a Senior Cloud Technologist at AWS.
"Monitoring is the act of collecting data. What types of data we collect, what we do with the data, and if that data is readily analyzed or available is a different story. This is where observability comes into play. Observability is not a verb, it's not something you do. Instead, observability is more of a property of a system." --- Morgan Willis
According to this explanation, tools such as CloudWatch or X-Ray can be viewed as monitoring or tracing tools. They allow us to collect logs and metrics about our system and send alerts about errors and incidents. Therefore, monitoring is an active part of collecting data that will help us assess the health of our system and how its different components work together. Once we establish monitoring that continuously collects logs, system outputs, metrics, and traces, our system becomes observable.
As a data engineer, I like to think of monitoring as the data ingestion part of ETL (extract, transform, load). Meaning, you gather data from multiple sources (logs, traces, metrics) and put them into a data lake. Once all this data is available, a skilled analyst can gain insights from that data and build beautiful dashboards that tell a story that this data conveys. That's the observability part --- gaining insights from the collected data. And observability platforms such as Dashbird play the role of a skilled analyst. They provide you with visualizations and insights about the health of your system.
Monitoring will get you information about your system and let you know if there's a failure, while Observability grants an easy way of understanding where and why that failure happened, and what caused it.
Monitoring is a prerequisite for observability. A system that we don't monitor is not observable.
The ultimate purpose of monitoring is to control a system's health by actively collecting error logs and system metrics and then leveraging those to alert about incidents. This means:
- tracking errors and alerting about them as soon as they happen,
- tracking metrics about CPU utilization or network traffic to later observe whether specific compute resources are healthy or not,
- reacting to outages and security incidents through alerting, alarms, and notifications.
Even though monitoring is an active process, AWS takes care of that automatically when we use CloudWatch or X-Ray.
The purpose of observability is to use the system's outputs to gather insights and act on them. Examples:
- identify the percentage of errors across all function or container invocations,
- identify bottlenecks in microservices by observing traces that show latency between individual function calls and transition between components,
- identify patterns of when the errors or bottlenecks occur and use the insights to take action in order to prevent such scenarios in the future,
- measure and assess the performance of an entire application,
- identify* cold starts*,
- identify how much memory does your application consume,
- identify when and how long your code runs,
- identify how much costs are incurred per specific resource,
- identify outliers --- ex. specific function invocation that took considerably longer than usual,
- identify how changes to one component affect other parts of the system,
- identify and troubleshoot the flow of traffic flowing through our microservices,
- identify how the system performs over time --- how many invocations of each function do we see per day, per week, or per month, and how many of them are successful.
Although serverless microservices offer a myriad of benefits in terms of decoupling, reducing dependencies between individual components, and overall faster development cycles, the biggest challenge is to ensure that all those small "moving parts" are working well together. It's highly impractical, if not impossible, to track all microservices by manually looking up the logs, metrics, and traces scattered across different cloud services.
When looking at AWS, you would have to go to AWS to see the logs, find your Lambda function's log group, then find the logs you are really interested in. Then, to see the corresponding API traces, you would go either to X-Ray or to CloudTrail and again search across potentially hundreds of components to find the one you want to investigate. As you can see, finding and accessing the logs and traces of every single component is quite time-consuming. Additionally, debugging single parts doesn't give you the "big-picture" view of how those components work together.
To put it simply, you get observability in your application by knitting together monitoring with alerting while having a clear debugging solution that provides clarity for your data. Missing just one of these aspects will leave you at a great disadvantage, chasing your tail trying to figure out what went wrong within your app. It's not enough to be notified every time something breaks down. Neither is having the insight of knowing when something is about to. You have to be able to pinpoint the issue within your platform efficiently.
With a growing architecture of microservices, we need an easier (automated) way to add observability to the serverless ecosystem.
Here's an example of a service we're all too familiar with -- Twitter. As you might imagine a product like Twitter has a lot of moving parts and when something breaks down it can be difficult to understand why or what caused the problem. Imagine having 350 million active users that interact with each other through your system, tweeting, liking, dm-ing, retweeting, and so on. That's a lot of information to follow and if you've ever worked on a platform this size you can imagine the kind of effort it would take to figure out why a tweet isn't posted or a message takes too long to be delivered.
Before they made the switch from a monolithic application to a distributed system, finding out why something doesn't work was, at times, as simple as opening an error log file and seeing what went wrong.
When you have hundreds maybe thousands of small services communicating asynchronously with each other, saying that debugging a simple thing like a tweet not firing would be hard is a complete understatement. They've posted a really cool post about their migration to microservices in 2013. Read the post here.
With distributed systems (read microservices), especially at scale, having observability into your platform is more than a necessity, it's a requirement that can't be circumvented by using only alerting or by only looking at logs. You need an environment that provides visibility to a microscopic level in order to have the right information on which to act upon.
Twitter's observability system is humongous and took years to develop into the well-oiled machine it is today.
"The Observability Engineering team at Twitter provides full-stack libraries and multiple services to our internal engineering teams to monitor service health, alert on issues, support root cause investigation by providing distributed systems call traces, and support diagnosis by creating a searchable index of aggregated application/system logs." -- Anthony Asta in Observability in Twitter part I
*Our time series metric ingestion service handles more than 2.8 billion write *requests per minute, stores 4.5 petabytes of time series data, and handles 25,000 query requests per minute
Antony Asta on the scope of their observability systems published in 2016 in a two-parter that covers architecture, metrics ingestion, time series database, and indexing services. Check out part one and part two.
Understandably, not all businesses have the resources and time to build their own observability systems. With a 2-minute setup, you can sign up to Dashbird and add observability to your serverless AWS architecture immediately. Each serverless component in your AWS account, on which you enabled CloudWatch logs and X-Ray or CloudTrail traces, is automatically monitored with those tools. But it's not yet observable until you do something with this collected data.
The true benefit of Dashbird is that it doesn't require any code changes and any effort on your side . It simply uses the data that already exists, i.e., data for which you already enabled monitoring with AWS-native services designed for that purpose.
As a serverless observability platform, Dashbird allows you to accomplish all of the points addressed when discussing examples of insights gathered from an observable system:
- be notified about incidents, cold starts, and errors as they happen via custom alerting,
- observe the percentage of errors across all invocations and identify potential outliers,
- find out how much memory does your application consume, as well as when and how long your code runs,
- identify how much costs are incurred per specific resource,
- ...and so much more.
Dashbird project view --- image by the author
While monitoring tools allow you to collect application logs as well as metrics about resource utilization and network traffic, or traces of HTTP requests made to specific services, observability is a property of a system that analyzes and visualizes collected data, thereby allowing you to improve your application lifecycle by gathering insights about the underlying system. Furthermore, observability in the serverless space is non-negotiable. You have to have it and it's not a quantifiable attribute, meaning you can't have some observability or too much of it. You either do or don't.