I recently had a tough time adding distributed tracing to our services using OpenTelemetry, Although that sounds not to be hard, lots of deprecated packages, changed solutions, no up-to-date documents, etc. led me to write this article for other engineers not to fall into the same hole.
I will start with a summary of the Tracing definition, then we will go through the right solution and mention holes along the way, it's my own experience (TechStory) and not a tutorial.
What is Tracing
Tracing is the process of capturing and recording information about the execution of software, traces come along with metrics and logs and they help us to have a better #observablity over our services.
Traces tell us what happened when, where, and took how long.
What is distributed tracing
In the microservices world, it's hard to see the request flow and integration between services, distributed tracing is a set of techniques that help us track user flow across multiple distributed services.
What tools can we use
Tracing and Distributed Tracing come with a bunch of dependencies (they may not be a good choice for small code bases with a few users)
The way of doing it and the number of dependencies also varies based on the tools you use and the architecture and design you follow, there might be:
- Agent (a network daemon that listens for spans sent over UDP)
- Collector (receives traces from the SDKs or Jaeger agents, runs them through a processing pipeline for validation and clean-up/enrichment, and stores them in a storage backend)
- Trace DB (for example, Elasticsearch)
- SDK and Libraries (OpenTracing, OpenTelemetry, agent-specific libraries, etc).
- UI
But we don't go through all of them in this article, we use:
- OpenTelemetry
- Jaeger all-in-one docker image (or Maintained Jaeger on Cloud)
Basic Setup
To setup OpenTelemtry SDK (a bunch of codes and function calls), you need to inject (pass) a Provider (Jaeger) to it,
To initialize the provider, you need to write a code like this:
tp := trace.NewTracerProvider(
// The sampler determines how many requests to trace
trace.WithSampler(trace.TraceIDRatioBased(cfg.SamplerParam)),
// Always be sure to batch in production.
trace.WithBatcher(exporter),
// Record information about this application in a Resource.
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(cfg.Name),
)),
)
Then if you run go mod tidy
you may face this error:
go.opentelemetry.io/otel/semconv: module go.opentelemetry.io/otel@latest found (v1.13.0), but does not contain package go.opentelemetry.io/otel/semconv
In that case, I wrote a good explanation and solution for you here.
TL;DR: Old Deprecated Jaeger versions caused this error, and they are more than one,
There were Deprecated Jaegers for OpenTracing, but OpenTracing itself is also deprecated.
If you want to use OpenTelemetry Jaeger libraries instead:
https://pkg.go.dev/go.opentelemetry.io/otel/exporters/trace/jaeger
https://pkg.go.dev/go.opentelemetry.io/otel/exporters/jaeger
These are also deprecated!
So What should we do? 😰
In fact, All Jaeger libraries are deprecated and you should use OTLP
instead nowadays (OTLP is not deprecated at the time of writing this article! but the world has gone wild! I don't guarantee tomorrow 😂)
So what is this OTLP thing?
In simple terms, OTLP (OpenTelemetry Protocol) is a protocol designed to standardize the telemetry data transfers between different clients, it's here to help the community not get the headache of being involved with a vast range of different tools and have a common standard instead.
For more information, I recommend you read their design goals.
In cases where you use a Jaeger all-in-one image (jaegertracing/all-in-one), you should be aware that this image includes an OpenTelemetry collector in versions >= 1.35 and you don't need to run a separate container (this official documentation confused me about it).
You should also expose some ports and enable some variables, I wrote a docker-compose file here that shows them all.
But if you have mismatched versions, you may get these errors:
traces export: failed to send to http://localhost:14268/api/traces: 400 Bad Request
traces export: failed to send to http://localhost:4318/api/traces: 404 Not Found
traces export: Post "http://localhost:4318/api/traces": read tcp [::1]:60409->[::1]:4318: read: connection reset by peer
traces export: Post "http://localhost:4318/v1/traces": dial tcp [::1]:4318: connect: connection refused
Simply say, you can run an all-in-one version >= 1.35, expose port 4318, and send your traces to http://localhost:4318/v1/traces
which is the default OTLP config. (Jaeger agent was on e.g. :6831/api/traces
, but you don't need that anymore)
Most of these issues were on Localhost, we had a managed Jaeger for production on our private cloud, and I got a new error like this:
traces export: context deadline exceeded: retry-able request failure
It happened because I was sending the traces to Jaeger Agent's address on production but I should've sent them to the collector, so if you faced the same error, double-check your address.
Propagation
Another hard topic was Propagation
, what is it?
propagated traces are the ones that keep their identity the same across multiple services and help us see the request flow through a distributed system, this is possible by attaching some trace-specific headers to your request, and it's applicable in different scenarios whether it's a simple HTTP request or Kafka event, etc.
There are two important notes here:
- Each service should support tracing and it should also be enabled (services that don't support tracing, will not be in the flow)
- OpenTelemetry support Propagation Generally, it means you are responsible for "what headers" and it just attaches them to the request for you.
In case of implementation, you should first setup the propagator like this:
import (
jaegerPropagator "go.opentelemetry.io/contrib/propagators/jaeger"
"go.opentelemetry.io/otel"
)
...
otel.SetTextMapPropagator(jaegerPropagator.Jaeger{})
Then, you can Inject
context to request headers when you are sending a request, and Extract
them on the receiver side, and when the service is both receiver and sender, it may implement both Inject
and Extract
.
Inject:
// ctx is the context that included some trace data
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(request.Header))
Extract (happening in middleware in this case):
ctx := otel.GetTextMapPropagator().Extract(
request.Context(),
propagation.HeaderCarrier(request.Header),
)
ctx, span := otel.Tracer("middleware").Start(ctx, "middleware")
Summary
We talked about different challenges I faced during the Tracing setup, but the challenges still apply to other scenarios (even other languages), #mismatched_version, #wrong_addresses, and #deprecated_pkgs, etc.
I hope you enjoyed this article.
Top comments (0)