loading...
Cover image for Serverless tracing with AWS X-Ray

Serverless tracing with AWS X-Ray

rolfstreefkerk profile image Rolf Streefkerk ・6 min read

TL;DR

Check out the AWS CloudWatch ServiceLens chapter on how powerful AWS X-Ray can be at spotting and signalling application errors.

Look at the repository below for the example implementation.

GitHub logo rpstreef / openapi-tf-example

Example of how you can use OpenAPI with AWS API Gateway, Also includes integrations with AWSLambda, AWS Cognito, AWS SNS and CloudWatch logs

Thanks for reading, and till next week!

Topic break-down

Serverless tracing with AWS X-Ray

For this article we're using the OpenAPI series Github repository. Please clone this repo to follow along.

We've discussed in previous articles; how to do proper monitoring of your serverless applications with CloudWatch Alarms, and how to implement structured logging that can be searchable.

This week I'd like to finish it by discussing tracing with AWS X-Ray. So what is exactly AWS X-Ray? To quote the AWS documentation:

AWS X-Ray helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. With X-Ray, you can understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors.

Basically this service collects information at each execution step of your Serverless application and then makes that information available through log files and visual maps and diagrams. Subsequently you can analyze and see exactly where failures occur within your serverless application.

To enable AWS X-Ray, we have to apply configuration at the level of infrastructure with Terraform and we have to do some on the code level as well.

AWS Lambda

To enable the X-Ray tracing on Lambda we actually don't have to do all that much on the infrastructure side. These are the two steps in our Terraform implementation:

1) AWS Lambda resource

In the aws_lambda_function resource we need to check for the tracing_config_mode parameter.
By default this parameter is set to PassThrough, which will pass through (as the word already suggests) the appropriate X-Ray headers via the API Gateway API call. In this case, we can leave this setting as is. We should set this to Active if we're executing the Lambda directly.

variable "tracing_config_mode" {
  description = "Can be either PassThrough or Active"
  default     = "PassThrough"
}

resource "aws_lambda_function" "_" {
  function_name                  = local.function_name
  role                           = var.lambda_role_arn
  runtime                        = var.lambda_runtime
  filename                       = var.lambda_filename
  handler                        = "handlers/${var.lambda_function_name}/index.handler"
  timeout                        = var.lambda_timeout
  memory_size                    = var.lambda_memory_size
  reserved_concurrent_executions = var.reserved_concurrent_executions

  source_code_hash = filebase64sha256("${var.dist_path}/${var.distribution_file_name}")

  layers = ["${var.lambda_layer_arn}"]

  tracing_config {
    mode = var.tracing_config_mode
  }

  environment {
    variables = var.lambda_environment_variables
  }

  tags = {
    Environment = var.namespace
    Name        = var.resource_tag_name
  }
}

2) AWS IAM Policy

For the AWS Lambda resource, we need to add the following policy to our existing policy json document to allow X-Ray tracing to be written and to read out the sampling rate rules (we'll get to that when we discuss API Gateway):

{
    "Effect": "Allow",
    "Action": [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords",
        "xray:GetSamplingRules",
        "xray:GetSamplingTargets",
        "xray:GetSamplingStatisticSummaries"
    ],
    "Resource": [
        "*"
    ]
}

That's it! On AWS Lamba there's no need to configure any libraries if you're not going to customize X-Ray segments and add execution meta-data.

However, to get better tracing results, it's recommended to use the official X-Ray SDK such that you can track exactly which AWS (and non-AWS) services are being executed.

AWS SNS

To enable SNS (or S3, HTTP(s), etc.) tracing, you have to use the official X-Ray SDK for Node.

Once we've added that library to the package.json, we need to enable the tracing on the SNS library as follows:

const AWS = require('aws-sdk')
const AWSXRay = require('aws-xray-sdk')

const sns = AWSXRay.captureAWSClient(new AWS.SNS({ apiVersion: '2010-03-31' }))

Now the tracing of any SNS function is automatically done for us. This will also work for other AWS services we're consuming, we just add the captureAWSClient wrapper and we're done.

If you want to further customize the logging with added (sub)segments and meta data, review this documentation on their Github page, that will show example code on how to achieve it.

Pro tip: In your AWS Lambda code, capture the X-Ray tracing id from the environment variable, process.env._x_amzn_trace_id, and add that to your Logger library such that you can use it in your log searches.

AWS API Gateway

Again, here are two step to configure in our Terraform implementation:

1) API Gateway stage resource

To enable X-Ray tracing for API Gateway, configure the stage resource as follows:

variable "xray_tracing_enabled" {
  description = "Enables the XRay tracing and will create the necessary IAM permissions"
  type        = bool
  default     = false
}

resource "aws_api_gateway_stage" "_" {
  stage_name    = var.namespace
  rest_api_id   = aws_api_gateway_rest_api._.id
  deployment_id = aws_api_gateway_deployment._.id

  xray_tracing_enabled = var.xray_tracing_enabled

  tags = {
    Environment = var.namespace
    Name        = var.resource_tag_name
  }
}

2) AWS X-Ray sampling rules

Then, optionally, we can configure sampling rules for your API Gateway endpoints. AWS X-Ray documentation says the following:

By customizing sampling rules, you can control the amount of data that you record, and modify sampling behavior on the fly without modifying or redeploying your code. Sampling rules tell the X-Ray SDK how many requests to record for a set of criteria. By default, the X-Ray SDK records the first request each second, and five percent of any additional requests. One request per second is the reservoir, which ensures that at least one trace is recorded each second as long the service is serving requests. Five percent is the rate at which additional requests beyond the reservoir size are sampled.

By default the following sampling rule is used:

sampling rule

We can override this rule or create rules with different priority via the aws_xray_sampling_rule resource:

resource "aws_xray_sampling_rule" "example" {
  rule_name      = "example"
  priority       = 10000
  version        = 1
  reservoir_size = 1
  fixed_rate     = 0.05
  url_path       = "*"
  host           = "*"
  http_method    = "*"
  service_type   = "*"
  service_name   = "*"
  resource_arn   = "*"

  attributes = {
    Hello = "Tris"
  }
}

In the example above, which is the default configuration, this rule will trace 5% (fixed_rate = 0.05) of all the traffic on all endpoints with at least 1 recorded trace per second (reservoire_size = 1).

In a production environment, we do not want to trace EVERY single execution as that will be very cost prohibitive, and we may want to apply a different configuration for specific endpoints depending upon traffic to get the right amount of debug logs and traces.

With that, we've finished the setup part of AWS X-Ray for our particular example. Next up is the AWS Console part where I'll show you how debugging will work in practice.

AWS CloudWatch ServiceLens

A new addition to CloudWatch is called ServiceLens, this service neatly integrates AWS X-Ray tracing, CloudWatch Alarms, and CloudWatch Logs. To get to this new service, that was introduced late last year 2019, navigate to CloudWatch and you'll find it in the menu on the left.

To generate this example data I've executed the Example API several times with Postman, and also introduced a few coding errors, this is the result:

service map

If we drill down into the specific execution details, we can see the logged traces and which ones have failed:

service map traces

Let's review the second log from the top:

service map timeline

We can already tell that the error is at the SNS invocation without having looked at any log files yet!

When we click on that segment called SNS, we can see right below the timeline the segment details and the actual error that ocurred including a stack trace.

Conclusion

The example demonstrated how easy it can be to spot errors in a particular segment of your code using X-Ray tracing. This combined with serverless CloudWatch Alarms monitoring, and structured logging on AWS Lambda. You've got a very powerful debugging and monitoring suite of services and tools that will allow you to resolve errors much quicker.

I hope this has been useful! Please leave a comment below with your experiences debugging on serverless/micro-service architectures.

Next week I'd like to discuss how to properly do deployments on AWS with a Serverless architecture, so we'll cover CI/CD with AWS CodePipeline.

Thanks for reading!

Further Reading

Discussion

pic
Editor guide