OpenTelemetry 101: What Is Tracing?

#observability #devops #opentelemetry

OpenTelemetry is an open-source observability framework for generating, capturing, and collecting telemetry data for cloud-native software. In my previous post, I covered what observability means for software, and talked about the three types of telemetry data that comprise observability signals -- tracing, metrics, and logs. Today, I’ll be taking a deeper look at the first of these, tracing.

When we refer to tracing in OpenTelemetry, we’re generally referring to distributed tracing (or distributed request tracing). Traditionally, tracing is a low-level practice used to profile and analyze application code by developers through a combination of specialized debugging tools (such as dtrace on Linux or ETW on Windows) and programming techniques. By contrast, distributed tracing is an application of these techniques to modern, microservice-based architectures.

Microservices introduce significant challenges to tracing a request through an application, thanks to the distributed nature of microservices deployments. Consider a traditional monolithic application: since your code is centralized onto a single host, diagnosing a failure can be as simple as following a stack trace. When your application consists of tens, hundreds, or thousands of services running across many hosts, you can’t rely on a single stack trace -- you need something that represents the entire request as it moves from service to service, component to component. Distributed tracing solves this problem, providing powerful capabilities such as anomaly detection, distributed profiling, workload modeling, and diagnosis of steady-state problems.

Much of the terminology and mental models that we use to describe distributed tracing can trace their origin to systems such as Magpie, X-Trace, and Dapper. Dapper, particularly, has been highly influential to modern distributed tracing, and many of the mental models and terminology that OpenTelemetry uses can trace their origin to there. The goal of these distributed tracing systems was to profile requests that moved across processes and hosts, and generate data about those requests suitable for analysis.

The above diagram represents a sample trace. A trace is a collection of linked spans, which are named and timed operations that represent a unit of work in the request. A span that isn’t the child of any other span is the parent span, or root span, of the trace. The root span, typically, describes the end-to-end latency of the entire trace, with child spans representing sub-operations.

To put this in more concrete terms, let’s consider the request flow of a system that you might encounter in the real world, such as a ride sharing app. When a user requests a ride, multiple actions begin to take place -- information is passed between services in order to authenticate and authorize the user, validate their payment information, locate nearby drivers, and dispatch one of them to pick up the rider. A simplified diagram of this system, and a trace of a request through it, appears in the following figure. As you can see, each operation generates a span to represent the work being done during its execution. These spans have implicit relationships (parent-child) both from the beginning of the entire request at the client, but also from individual services in the trace. Traces are composable in this way, where a valid trace is comprised of valid sub-traces.

Each span in OpenTelemetry encapsulates several pieces of information, such as the name of the operation it represents, a start and end timestamp, events and attributes that occurred during the span, links to other spans, and the status of the operation. In the above diagram, the dashed lines connecting spans represent the context of the trace. The context (or trace context) contains several pieces of information that can be passed between functions inside a process or between processes over an RPC. In addition to the span context, identifiers that represent the parent trace and span, the context can contain other information about the process or request, like custom labels.

One important feature of spans, as mentioned before, is that they encapsulate other information. Much of this information is required -- the operation name, the start and stop timestamps, for example -- but some is optional. OpenTelemetry offers two data types, Attribute and Event which are incredibly valuable as they help to contextualize what happened during the execution measured by a single span. Attributes (known as tags in OpenTracing) are key-value pairs that can be freely added to a span to help in analysis of the trace data. You can think of attributes as data that you’d like to eventually aggregate or use to filter your trace data, like a customer identifier, process hostname, or anything else you can imagine. Events (known as logs in OpenTracing) are time-stamped strings that can be attached to a span, with an optional set of Attributes that further describe it. OpenTelemetry additionally provides a set of semantic conventions of reserved attributes and events for operation or protocol specific information.

Spans in OpenTelemetry are generated by the Tracer, an object that tracks the currently active span and allows you to create (or activate) new spans. Tracers are configured with Propagator objects that support transferring the context across process boundaries. The exact mechanism of creating and registering a tracer is dependent on your implementation and language, but you can generally expect there to be a global Tracer capable of providing a default tracer for your spans, and/or a Tracer provider capable of granting access to the tracer for your component. As spans are created and completed, the tracer dispatches them to the OpenTelemetry SDK’s Exporter, which is responsible for sending your spans to a backend system for analysis.

To recap, let’s summarize:

A span is the basic building block of a trace. A trace is a collection of linked spans.
Spans are objects that represent a unit of work, which is a named operation such as the execution of a microservice or a function call.
A parentless span is known as the root span or parent span of a trace.
Spans contain attributes and events, which describe and contextualize the work being done under a span.
A tracer is used to create and manage spans inside a process, and across process boundaries through propagators.

In my next post in this series, I plan to discuss the OpenTelemetry metrics data source, and how it interacts with the traces. Stay tuned!