Observability is defined as how well one understands the inner workings of a system by examining its output. It gives us the much needed insights into our applications that are in production. Observable applications are easy to troubleshoot as one is able to clearly identify bottlenecks in the applications or root cause of an issue. Observability is important especially with the increase in complexity of systems. Without observability, monitoring of large and complex applications would be a daunting task.
Observability is built on three pillars: metrics, logs, and traces. All these together are known as telemetry data. The applications being monitored are supposed to emit the data which can then be sent to an appropriate observability platform such as NewRelic
Metrics are numerical measurements on a particular aspect of a system behaviour, such as CPU usage, memory usage, and request latency. Metrics are typically collected at regular intervals and aggregated over time.
Metrics are used to indicate the measure of the performance of a system. e.g request latency is a metric that measures how fast a system responds to a request. Measuring the request latency helps one gain insights about how slow or fast a system is.
Logs are time stamped records of a wide variety of system activity. Logs are typically used for debugging but lack contextual information as compared to traces. Logs can contain a wide varierty of information which with one key piece of imformation being the log level. Log level indicates the severity of the logs, with the lowest level indicating an event or activity that requires a prompt attention. The logging levels of as per the RFC5424 standards as shown below.
0 - Emergency: system is unusable
1 - Alert: action must be taken immediately
2 - Critical: critical conditions
3 - Error: error conditions
4 - Warning: warning conditions
5 - Notice: normal but significant condition
6 - Informational: informational messages
7 - Debug: debug-level messages
A trace is a collection of spans that represent a complete request or transaction. Traces track the flow of a request through a distributed system, such as a web application or a microservices architecture. A span is a unit of work or operation within a trace. Spans are used to represent the individual operations that make up a trace, such as a web request, a database query or a service call.
Traces are mainly used for debugging and troubleshooting purposes. They are the most detailed of the three pillars as they offer a really in depth view into an application.
Observability has many benefits for DevOps teams and SREs, including:
Improved debugging and troubleshooting: Observability can help teams to quickly and easily identify the root cause of problems in production. This can lead to faster resolutions and less downtime.
Improved performance and reliability: Observability can help teams to identify and fix performance bottlenecks and reliability issues. This can lead to a better user experience and reduced costs.
Reduced risk of outages and incidents: Observability can help teams to identify potential problems before they cause outages or incidents. This can help to reduce the risk of service disruptions and data loss.
Observability is a rapidly evolving field. New tools and technologies are emerging all the time. One of the most exciting trends in observability is the OpenTelemetry project which seeks to provide a uniform way of creating and managing telemetry data. The project is a gamechanger as is it easens the setup of metrics, logs and traces through automatic instrumentation.