Yoav Danieli for Aspecto

Posted on Sep 19, 2022 • Edited on Nov 8, 2022 • Originally published at aspecto.io

How to Instrument AWS Services with OpenTelemetry

In this AWS OpenTelemetry guide, you will learn how to instrument your AWS services with OpenTelemetry. I will demonstrate how to instrument a simple microservice system running on AWS. We will use AWS SQS, DynamoDB, and Lambda.

First, we will create all the resources we require using Terraform and AWS. Secondly, we will instrument our code automatically using Opentelemetry node sdk. Lastly, we will deploy the system and visualize the traces it creates.

Let’s begin!

What is OpenTelemetry

OpenTelemetry is a community-driven open-source project and a unified specification for how we collect, generate, and export telemetry data to analyze our software’s behavior.

Sponsored by the CNCF (Cloud Native Computing Foundation), the OpenTelemetry project provides APIs and SDKs per programming language for generating telemetry, a centralized Collector that receives, processes, and exports data, and the OTLP protocol for shipping telemetry data.

In a cloud-native environment, we use OpenTelemetry (OTel for short) to gather data from our system operations and events. In other words, to instrument our distributed services. This data enables us to understand and investigate our software’s behavior and troubleshoot performance issues and errors.

To get a deeper understanding of this technology, watch our free OpenTelemetry Bootcamp video series that, in 6 episodes, cover OTel from end to end (it’s binge-worthy, really).

AWS + OpenTelemetry: Creating our application

We will create a simple Order taking application. It will be composed of the following components:

An order-api written with Node and deployed as an AWS Lambda.
AWS DynamoDB database
AWS SQS messaging queue
External service written with Node that listens on the SQS queue.

The application will receive orders and insert those orders into the database with status processing. It will publish an order received message to SQS then the external service in charge of processing these orders will receive this message and change the order status in the database to completed.

That is it! Let’s start by setting up AWS lambda.

Setup Disclaimer

There are many ways of interacting with the AWS platform. Each with its pros and cons. I like using Terraform and other PaaS code so that I can easily change, configure, replicate or migrate my work. So in this tutorial, I will create all the AWS services and resources I require using Terraform. Explanations regarding Terraform and AWS permissions and configurations are out of this article’s scope.

For the exact implementation details, you’re welcome to read the code example for this guide.

AWS Lambda Setup

The first component of our system is the order-api. Let’s create a new Typescript project and add the lambda code to the index.ts file:

import { SQS, DynamoDB } from "aws-sdk";
import { Context, APIGatewayEvent, APIGatewayProxyResult } from 'aws-lambda';
import { SendMessageRequest } from "aws-sdk/clients/sqs";
const sqs = new SQS();
const dynamo = new DynamoDB.DocumentClient();
const handler = async (event: APIGatewayEvent, context: Context): Promise<APIGatewayProxyResult> => {
   let body; ``
   let statusCode = 200;
   const routeKey = `${event.httpMethod} ${event.resource}`
   const headers = {
       "Content-Type": "application/json"
   };
   try {
       const tableName = process.env.DDB_TABLE_NAME as string
       const sqsUrl = process.env.SQS_URL as string
       if (!tableName) {
           return { statusCode: 500, body: 'Missing environment variable DDB_TABLE_NAME', headers }
       }
       if (!sqsUrl) {
           return { statusCode: 500, body: 'Missing environment variable SQS_URL', headers }
       }
       const id = event.pathParameters?.id
       switch (routeKey) {
           case "DELETE /items/{id}": // Delete Item...
           case "GET /items/{id}": // get item...
           case "GET /items": // get all items...
           case "PUT /items":
               let requestJSON = JSON.parse(event.body as string);
               await dynamo.put({
                   TableName: tableName,
                   Item: {
                       id: requestJSON.id,
                       price: requestJSON.price,
                       name: requestJSON.name,
                       status: "processing"
                   }
               }).promise();
               const params: SendMessageRequest = {
                   MessageBody: JSON.stringify({ id: requestJSON.id }),
                   QueueUrl: sqsUrl,
               }
               await sqs.sendMessage(params).promise();
               body = `Put item ${requestJSON.id}`;
               break;
           default:
               throw new Error(`Unsupported route: "${routeKey}"`);
       }
   } catch (err: any) {
       statusCode = 400;
       body = err.message;
   } finally {
       body = JSON.stringify(body);
   }
   return {
       statusCode,
       body,
       headers
   };
};
module.exports = { handler }

As you can see, the Put clause creates a new item with status “processing” and inserts it into DynamoDB. It then sends the newly created order’s id as a string to SQS.

Let’s compile the code using tsc and ensure a new dist folder was created.

Now add this to your terraform configuration:

resource "aws_lambda_function" "order_api" {
 function_name = "order_api"
 filename         = data.archive_file.lambdas_code.output_path
 source_code_hash = data.archive_file.lambdas_code.output_base64sha256
 role    = aws_iam_role.iam_for_lambda.arn
 handler = "index.handler"
 runtime = "nodejs12.x"
 timeout = 10
  environment {
   variables = {
     SQS_URL = # TBD
     DDB_TABLE_NAME = # TBD

   }
 }
}
data "archive_file" "lambdas_code" {
 type        = "zip"
 output_path = "${path.module}/dist.zip"
 source_dir  = "${PATH_TO_DIST_FOLDER}/dist"
}

Important Notes

Before deploying the Lambda, verify that the compiler target language is compatible with the Node runtime you specified. For compatibility check, visit here.
Check that the source dir specified in the archive_file resource contains index.js and node_modules.
In the example code for this blog, I also added an AWS api gateway so we can trigger the Lambda using a public URL. That is out of scope for this blog, but you can visit the source code and check it out.

AWS SQS Setup

Create the SQS queue and add the created queue URL as an environment variable to our lambda.

resource "aws_sqs_queue" "order_queue" {
 name = "orderQueue"
}

Change SQS_URL:

SQS_URL = aws_sqs_queue.order_queue.id

AWS DynamoDB

Create a DynamoDB table and add the table name as an environment variable to our lambda.

resource "aws_dynamodb_table" "order_table" {
 name           = "Orders"
 billing_mode   = "PROVISIONED"
 read_capacity  = 20
 write_capacity = 20
 hash_key       = "id"
 attribute {
   name = "id"
   type = "S"
 }
}

Change DDB_TABLE_NAME:

DDB_TABLE_NAME = aws_dynamodb_table.order_table.name

OpenTelemetry + AWS: External Service

The external service is a simple Node service. It can run anywhere as long as it has the permissions necessary to read messages from the queue and update items in the DynamoDB table. This example runs the service locally.

To receive the messages from the queue, we will use the sqs-consumer library. The service will receive messages describing newly created orders. After some processing, it will change the order status in the table to ‘completed’.

Create sqs-listner.ts:

import { DynamoDB } from "aws-sdk";
import { UpdateItemInput } from "aws-sdk/clients/dynamodb";
import { Consumer } from "sqs-consumer";
const dynamo = new DynamoDB.DocumentClient({ region: process.env.AWS_REGION as string });
async function dynamoUpdateCompleted(id: string) {
   const params: UpdateItemInput = {
       TableName: process.env.DDB_TABLE_NAME as string,
       Key: {
           // @ts-ignore
           id
       },
       ExpressionAttributeNames: {
           "#S": "status"
       },
       // This expression is what updates the item attribute
       ExpressionAttributeValues: {
           ":s": { S: "complete" }
       },
       UpdateExpression: "SET #S = :s",
       ReturnValues: "ALL_NEW",
   };
   await dynamo.update(params).promise()
}

const app = Consumer.create({
   queueUrl: process.env.SQS_URL as string,
   handleMessage: async (message) => {
       if (message.Body) {
           const parsedBody = JSON.parse(message.Body);
           // do some processing and then change status to complete
           await dynamoUpdateCompleted(parsedBody.id);
       }
   }
});
app.on('error', (err) => {
   console.error(err.message);
});
app.on('processing_error', (err) => {
   console.error(err.message);
});
app.start();

Important Notes

Make sure you have AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID configured on your machine.
The message is deleted from the queue once the handleMessage function is completed. That is important because it lets whoever uses this queue know that the message was received and acknowledged.

Instrumenting AWS services with OpenTelemetry and exporting telemetry data

To instrument Node services, we create a file containing the code and the configuration for our tracing.

We then require this file before any other code in our system runs so it can wrap the relevant function and create spans for any of the operations our system is performing.

On AWS, it is no different. Let’s create a tracing.ts file and add the OpenTelemetry configuration to it.

The OpenTelemetry is implemented and accessed by the application through the NodeSDK instance. So we will initiate it with our configuration.

First, let’s add the instrumentations. Our Lambda services use the aws-sdk library and the aws-lambda library. So any library that provides auto-instrumentations for these operations should be enough. Luckily when using Node, we can use the auto-instrumentation-node library.

It includes all Node auto instrumentations available to the public, including:

Important Configuration Notes

The lambda instrumentation is trying to use the X-Ray context headers by default (even when we’re not using X-Ray), causing us to have a non-sampled context and a NonRecordingSpan. To fix this, we use the disableAwsContextPropagation flag. More information about this can be found here and in the instrumentation docs.
The SQS queue we have created is configured by default to show message attributes in its payload. That is called ‘Raw Message Delivery.’ You can read more about it in AWS docs. When this is the case, we need to explicitly tell our program to receive its context from the payload. We do this by enabling sqsExtractContextPropagationFromPayload and setting it to true. Know that there will be performance implications because now the instrumentation will run JSON.parse to get the data.

Let’s process our spans using a batchSpanProcessor and a custom OTLPTtraceExporter. We can export our traces to the console, but to watch them, we will have to visit AWS CloudWatch (which will l be a bit messy).

Visualizing OpenTelemetry Data in Aspecto

AWS has good tools for tracing, but in this example, I will use another remote and distributed tracing platform – Aspecto.

To follow along, you can open a new free-forever Aspecto account or log in to your existing one.

Below, make sure to replace the {ASPECTO_API_KEY} with your unique Aspecto token ID – https://app.aspecto.io/app/integration/token (Settings > Integrations > Tokens)

Finally, let’s give our service a name that can be taken from either the SERVICE_NAME or the AWS_LAMBDA_FUNCTION_NAME environment variables.

Putting it all together, it looks something like that:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const exporter = new OTLPTraceExporter({
   url: 'https://otelcol.aspecto.io/v1/traces',
   headers: {
       // Aspecto API-Key is required
       Authorization: process.env.ASPECTO_API_KEY
   }
})
const sdk = new NodeSDK({
   spanProcessor: new BatchSpanProcessor(exporter),
   serviceName: process.env.SERVICE_NAME || process.env.AWS_LAMBDA_FUNCTION_NAME,
   instrumentations: [
       getNodeAutoInstrumentations({
           "@opentelemetry/instrumentation-aws-sdk": {
               sqsExtractContextPropagationFromPayload: true
           },
           "@opentelemetry/instrumentation-aws-lambda": {
               disableAwsContextPropagation: true
           }
       })
   ]
});
sdk.start()

To use OpenTelemetry Node SDK, we must load and run it before our application is loaded. So let’s compile it using tsc so it ends up inside the dist folder and be packaged together with the Lambda code.

After that, we need to add a NODE_OPTIONS environment variable to our Lambda, notifying the node runtime to run this process before the lambda code. Here’s the final Lambda configuration:

resource "aws_lambda_function" "order_api" {
 function_name = "order_api"
 filename         = data.archive_file.lambdas_code.output_path
 source_code_hash = data.archive_file.lambdas_code.output_base64sha256
 role    = aws_iam_role.iam_for_lambda.arn
 handler = "index.handler"
 runtime = "nodejs12.x"
 timeout = 10
  environment {
   variables = {
     ASPECTO_API_KEY = var.ASPECTO_API_KEY
     SQS_URL = aws_sqs_queue.order_queue.id
     DDB_TABLE_NAME = aws_dynamodb_table.order_table.name
     NODE_OPTIONS =  "--require tracing.js"
   }
 }
}

Important Notes

You don’t have to create an opentelemetry configuration file such as this for each of your lambdas. In fact, you shouldn’t. In AWS, you can use Lambda Layers. You can define the OpenTelemetry tracing piece of code as a Lambda layer and use it in any Lambda you want. Furthermore, OpenTelemetry went ahead and implemented this opentelemetry-lambda layer for us. All we need to do is use it with our config.

Running AWS Services and Visualizing OpenTelemetry Data in Aspecto

When running the SQS-listener service, remember to require the tracing file configuration.

Note that for this service, you can remove the disableAwsContextPropagation flag.

Let’s compile and run the service:

SERVICE_NAME=sqs-listener node --require path/to/tracing.js path/to/sqs-listener.js

Now the service is waiting for messages to be sent to the queue. Let’s deploy our Lambda so we can invoke it and send messages to the queue:

terraform -chdir=path/to/terraform/config apply --var ASPECTO_API_KEY=${ASPECTO_API_KEY}

You can invoke the Lambda using AWS CLI, or if you set up an api gateway, you can also make http requests.

curl --header "Content-Type: application/json" \
--request POST \
--data '{"id":"1","price":100, "name": "Awesome Item"}' \
http://your-api-gateway-lambda-invoke-url/items

You should get back a 200 response with the message “Put item 1”. Let’s check out our traces in Aspecto:

I received the following traces:

As you can see, our flow is divided into two traces. This happens because SQS sends messages in batches. I will explain more about it below. For now, let’s examine the traces.

Click the “order_api” lambda trace:

We can see 5 spans:

The Lambda trigger
AWS-SDK operation for dynamo DB + http request.
AWS-SDK operation for SQS + http request

Clicking the SQS span:

We can see that the message is linked and was received by another trace. A click on this link redirects us to the second trace:

This trace contains 7 spans. By clicking the Node service, I can see a more convenient view of the spans (on the left)

It seems that 4 spans were created from the AWS-SDK instrumentation, while the other 3 from the HTTP-instrumentation. We can disable the http instrumentations by passing the following flag in the configuration: suppressInternalInstrumentation.

By putting these two traces together we can get a full and clear view of our system.

OpenTelemetry AWS SQS: Batch Receiving

According to the opentelemetry specification for messaging systems, When a process receives messages in a batch it is impossible for this process to determine the parent span for the span that it is currently creating.

Since a span can only have one parent if it is propagated and the propagated trace and span IDs are unknown when the receiving span is started, the receiving span will not have a parent, and the processing spans are correlated with the producing spans via links.

Conclusion

In this guide, you hopefully learned how to create a simple application using AWS lambdas, SQS, DynamoDB, and instrument the application using OpenTelemetry. I also pointed out some tips and gotchas I encountered while working on this project.

Please feel free to contact me with any questions, comments, or suggestions.