Introduction
Open Telemetry is good enough to use in production projects now and most cloud providers and telemetry services have integrated open telemetry into their products.
In this article I’ll briefly describe what Open Telemetry is, how it can help you build better products and systems and why you should consider using it now.
In the second half of the article I’ll describe how to set up Open Telemetry in javascript applications with open telemetry collector and some backends like Zipkin, Prometheus and Grafana.
What is Open Telemetry?
Telemetry in software context means the metrics, events logging and tracing generated by an application or a whole distributed system when running.
This data is used to improve our applications. Product managers, DevOps and developers can monitor a full distributed system from a customer perspective. We can detect issues in code early and alert on them. We can find the sources of problems quickly despite the complexity in modern systems.
Without telemetry data, finding the root cause of an issue in Service3 below could be very difficult and time-consuming.
With telemetry available you can correlate calls between services and any logs the developer(s) added. You can use those almost as a callstack to debug your problem in a distributed system.
There have been products and services to do this in the market for decades but up until now there wasn’t a standard. So you typically had to instrument your application and systems with proprietary libraries from the service providers.
The transmission of telemetry data often used custom propagation patterns and data models. These were incompatible with other providers so it was difficult to build universal tooling for working with telemetry data.
Open Telemetry standardises how you instrument your system by providing vendor-neutral instrumentation libraries with common terminology and usage.
Open Telemetry standardises propagation providers and it gives you vendor-neutral infrastructure to collect the telemetry data and forward on to any provider your organisation supports now or in the future - with no changes to the code in your system!
Why is open telemetry worth investigating now
Open Telemetry is a massive project from the Cloud Native Computing Foundation (CNCF) with many component parts.
CNCF projects are given a status based on their level of industry adoption in the “Crossing the Chasm” chart.
Open telemetry is currently in the incubating stage. This is because some SDKs are still being perfected before entering release candidate stage.
Open Telemetry uses “Stable” to describe APIs that are fixed and essentially production ready. As of writing this post - tracing is stable, metrics is in release candidate stage and logging is in draft stage.
The Open Telemetry observability framework is currently being adopted by early adopters in production software. All of the major telemetry producers and services are adopting the standard for their offerings. Many already have full support for tracing.
For vendors that don’t support open telemetry yet there are often collectors or exporters available to temporarily convert open telemetry data to the proprietary format of the vendor until they have implemented support.
The following vendors have excellent support for open telemetry data generation and collection today.
- Honeycomb
- Datadog
- New Relic
- Azure App Insights
- AWS X-Ray (ADOT)
- GCP Cloud Monitoring and Cloud Trace
How Open Telemetry helps you build better systems
Open telemetry standardises the components needed for observability in a software system. These components are typically:
- Instrumentation
- Propagation
- Collection
and
- Backends
This diagram is a messy hodge-podge! Each colour-coded section is described in detail below so read on for it make more sense.
Instrumentation
Automatic Instrumentation is already provided by open telemetry libraries that are available for most major software development languages and frameworks like .Net, Java, Go, Javascript/NodeJs. Depending on the framework’s standard library there is significant auto-instrumentation. Most frameworks have their http and grpc clients have automatically instrumented for example.
There is also wide support for common libraries used in each ecosystems - e.g. for postgres database the NpPGSQL
library on .Net and the pg
library on NodeJs has an existing open telemetry instrumentation library you can just plug in to your code.
If you’re using a service mesh sidecar service like Dapr, Istio or Linkerd they usually support an open telemetry exporter or an open census exporter in the case of linkerd. The open telemetry collector can accept these formats for you.
The instrumentation libraries are extremely “pluggable”. They are designed so you can inject functionality you need to gradually move your existing code to open telemetry.
The pluggable components include instrumentation trace “exporters” for popular formats like Jaeger, ZipKin and Open Telemetry itself. For example if you had been using Jaeger backend then you can keep it and change your instrumentation to open telemetry gradually.
Propagation
Propagation is the method by which you trace a request around your stateless systems. Because they are stateless you have to include the trace context identifier in the request somehow.
Because the most popular message transport mechanisms these days are http based, the preferred propagation mechanisms are based around http capabilities - headers.
The proprietary observability systems all use different headers at the moment. For example AWS X-Ray uses X-Amzn-Trace-Id
and Dynatrace uses x-dynatrace
. Open telemetry supports a few different standards like B3 but it has standardised on the W3C propagation standard.
The W3C propagation standard was created in conjunction with all the major cloud providers and will be the recommended standard in the future. W3C propagation standard uses http headers traceparent
and tracestate
to propagate the trace context in your requests.
The standardisation of these headers means that all cloud middleware can be written by the large cloud vendors to automatically support the propagation of those headers. This increases the depth of your tracing, especially in modern cloud-native systems.
Because the headers are now standard it means a cloud native service that originates a request (a timed trigger for a serverless function for example) can initiate tracing for you, right from the origin of a request.
Open Telemetry Collection
The open telemetry collector is a service you would run in your system. It receives telemetry data in many many formats. It processes this data and then exports it to many possible destinations (backends).
The collector service is not required if you only use one backend and it is supported by all of your instrumentation sources. But it’s highly recommended because it can handle retries, batching, encryption and removing sensitive data.
The collector is designed as a “pluggable” architecture so you can create a telemetry handler that exactly matches the needs of your system. The components are
- Receivers - you can receive from multiple sources e.g. OTEL format, AWS X-Ray format (e.g. the lambda decorators), directly from kafka
- Processors - Filtering sensitive data, cleaning unwanted traces, batching to exporters, modification of traces
- Exporters - you can export to multiple destinations - X-Ray, honeycomb, zipkin, jaeger. These are not to be confused with instrumentation trace exporters.
The flexibility of the collector means you can collect from your existing instrumented systems while adding support for open telemetry systems. You can export to your existing backend while adding support for others if you want to try them out. This is all configured easily via yaml.
The collector supports modern transport standards between instrumentation trace exporters and collector receivers like protobuf over grpc or http. Support for JSON in http is experimental.
How to integrate open telemetry in a NestJs application
The best way to learn is to set up an application with open telemetry so lets do that!
The code for this example is available here in this repository: https://github.com/darraghoriordan/dapr-inventory
See packages/products-api
for the node js example.
Install Open Telemetry dependencies
yarn add @opentelemetry/api @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-zipkin @opentelemetry/sdk-node @opentelemetry/semantic-conventions
// or
npm i @opentelemetry/api @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-zipkin @opentelemetry/sdk-node @opentelemetry/semantic-conventions
Create Open Telemetry Configuration
Create a new class to hold the configuration. We configure many of the instrumentation components discussed so far. There are inline comments in the code that should be helpful.
The important thing here is the getNodeAutoInstrumentations
. This automatically detects and instruments many popular node libraries by monkey patching them. If performance or package size is a concern (e.g. for lambda functions) then you might only include instrumentations you actually need.
The list of auto instrumentations are listed on the npm package page: https://www.npmjs.com/package/@opentelemetry/auto-instrumentations-node
import { W3CTraceContextPropagator } from '@opentelemetry/core'
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api'
import { Resource } from '@opentelemetry/resources'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
// Set an internal logger for open telemetry to report any issues to your console/stdout
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.WARN)
export const initTelemetry = (config: {
appName: string
telemetryUrl: string
}): void => {
// create an exporter to an open telemetry exporter. We create this collector instance locally using docker compose.
const exporter = new OTLPTraceExporter({
url: config.telemetryUrl, // e.g. "http://otel-collector:4318/v1/traces",
})
// We add some common meta data to every trace. The service name is important.
const resource = Resource.default().merge(
new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: config.appName,
application: config.appName,
})
)
// We use the node trace provider provided by open telemetry
const provider = new NodeTracerProvider({ resource })
// The batch span provider is more efficient than the basic provider. This will batch sends to
// the exporter you have configured
provider.addSpanProcessor(new BatchSpanProcessor(exporter))
// Initialize the propagator
provider.register({
propagator: new W3CTraceContextPropagator(),
})
// Registering instrumentations / plugins
registerInstrumentations({
instrumentations: getNodeAutoInstrumentations(),
})
}
Run Telemetry Configuration on NestJs startup
Note it is vital that you run the open telemetry configuration before anything else bootstraps in your NestJs application.
You must run the initialisation before even importing any NestJs libraries in your bootstrapping method.
import { initTelemetry } from './core-logger/OpenTelemetry'
// ----- this has to come before imports! -------
initTelemetry({
appName: process.env.OPEN_TELEMETRY_APP_NAME || '',
telemetryUrl: process.env.OPEN_TELEMETRY_URL || '',
})
console.log('initialised telemetry')
// -------------
// follow with your nest js imports and bootstrapping....
import { ClassSerializerInterceptor, ValidationPipe } from '@nestjs/common'
import { NestFactory, Reflector } from '@nestjs/core'
// ... etc
const app = await NestFactory.create(MainModule, { cors: true })
// ... etc
Custom instrumentation
That’s it for instrumenting our app. Super simple.
If you need to you can create custom trace spans.
Here is an example of a method with a custom span.
async getAllProductsScan(): Promise<ProductDto[]> {
// get a trace context
const tracer = opentelemetry.trace.getTracer("basic");
// create a span
const span = tracer.startSpan("getAllProductsScan");
// do some work
const products = await this.client.send(
new ScanCommand({TableName: "products"})
);
// add some meta data to the span
span.setAttribute("thisAttribute", "this is a value set manually");
span.addEvent("got the data from store", {
["manualEventAttribute"]: "this is a value",
});
const mappedProducts = (products.Items || []).map((i) => {
return this.mapOne(i);
});
// finalise the span
span.end();
return mappedProducts;
}
Configuring the Open Telemetry collector
The open telemetry collector is run as a docker container with a configuration file attached as a volume. There were some notable configuration things worth mentioning here.
You have to set cors origins for anything that requires them using http to connect to the collector. I use the wildcard but this would be bad practice on production system.
Unless you’re in a completely isolated environment where external calls are passed through a gateway and/or firewall. Even then you might consider setting origins correctly.
The zipkin exporter actually pushes traces to zipkin, it literally exports. But the prometheus exporter is an endpoint on the collector that the prometheus server will poll for data.
These two paradigms are quite different! It’s important to understand what “exporter” means for each one you configure, especially important for port allocation on the open telemetry collector.
As of writing this open telemetry log support is still in development so I log directly to seq
from all of my applications, rather than going through the otel collector.
receivers:
otlp:
protocols:
grpc:
include_metadata: true
http:
cors:
allowed_origins:
- '*'
allowed_headers:
- '*'
include_metadata: true
zipkin:
processors:
batch:
attributes:
actions:
- key: seq
action: delete # remove sensitive element
exporters:
zipkin:
endpoint: 'http://zipkin:9411/api/v2/spans'
logging:
loglevel: debug
prometheus:
endpoint: '127.0.0.1:9091' # this is weird because the exporter is actually an endpoint that must be scraped
extensions:
health_check:
pprof:
zpages:
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [zipkin]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
# logs:
# receivers: [otlp]
# processors: [batch]
# exporters: [zipkin]
Configuring the infrastructure
I use docker compose to run infrastructure locally but on production you might use any infrastructure.
Telemetry docker-compose.yaml - i just use this to setup the container tags to get.
version: '3.4'
services:
seq:
image: datalust/seq:latest
zipkin:
image: openzipkin/zipkin-slim
otel-collector:
image: ${REGISTRY:-daprinventory}/otelcollector:${TAG:-latest}
build:
context: ./packages/otel-collector
depends_on:
- grafana
- pushgateway
prometheus:
image: prom/prometheus:v2.35.0
restart: unless-stopped
depends_on:
- pushgateway
- alertmanager
alertmanager:
image: prom/alertmanager:v0.24.0
restart: unless-stopped
depends_on:
- pushgateway
pushgateway:
image: prom/pushgateway:v1.4.3
restart: unless-stopped
grafana:
image: grafana/grafana:9.0.5
restart: unless-stopped
depends_on:
- prometheus
These are the Local overrides for development.
Some notable configuration here were the ports! So many ports. Find yourself a system with ranges you can roughly remember, to expose any required ports.
Pay close attention to the prometheus volumes and grafana volumes. They are a bit complex in how they’re configured.
Alert manager and prometheus push gateway are additions to the prometheus service. They’re not really required in development, especially for my little demo application but very likely are required in production.
version: '3.4'
services:
seq:
environment:
- ACCEPT_EULA=Y
ports:
- '5340:80'
zipkin:
ports:
- '5411:9411'
prometheus:
volumes:
- ./.docker-compose/.persist/prometheus/runtime:/prometheus
- ./packages/prometheus:/etc/prometheus
command:
- '--web.listen-address=0.0.0.0:9090'
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
# - "--web.console.libraries=/etc/prometheus/console_libraries"
# - "--web.console.templates=/etc/prometheus/consoles"
# - "--storage.tsdb.retention.time=200h"
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--web.enable-remote-write-receiver'
- '--web.page-title=DaprInventoryTimeseries'
- '--log.level=debug'
ports:
- '9090:9090'
alertmanager:
volumes:
- ./packages/alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
ports:
- '9093:9093'
pushgateway:
expose:
- '9091:9091'
ports:
- '9091:9091'
grafana:
volumes:
- ./.docker-compose/.persist/grafana:/var/lib/grafana
- ./packages/grafana:/etc/grafana
environment:
- GF_SECURITY_ADMIN_USER=auser
- GF_SECURITY_ADMIN_PASSWORD=apassword
- GF_USERS_ALLOW_SIGN_UP=false
- GF_LOG_LEVEL=info
ports:
- '9000:3000'
otel-collector:
command: ['--config=/etc/otel-collector-config.yaml']
ports:
- '1888:1888' # pprof extension
- '8888:8888' # Prometheus metrics exposed by the collector
- '8889:8889' # Prometheus exporter metrics
- '13133:13133' # health_check extension
- '4317:4317' # OTLP gRPC receiver
- '4318:4318' # OTLP http receiver
- '55679:55679' # zpages extension
- '5414:9411' # zipkin receiver
These are the main open telemetry configurations for the local development setup.
It’s incredible that we can have all of this running locally, and have full telemetry locally.
We just change to a managed service like AWS X-ray just by changing some yaml configuration used in the open telemetry collector for production environments.
Conclusion
Open telemetry is supported by every cloud provider and every observability provider now.
Even though it is still in incubation stage you should start moving telemetry to it, especially for any new distributed applications.
Some of the configuration is tricky but once it’s working it’s incredibly powerful. The instrumentation ecosystem is only going to get better as the entire industry converges on open telemetry.
Let me know if you have any questions about open telemetry on NestJs!
Top comments (0)