I've written about this topic in my most recent article.
Serverless tracing with AWS X-Ray
Rolf Streefkerk γ» Feb 16 '20
#aws
#devops
#serverless
#terraform
Now I'm keen to know your experience with debugging on AWS Serverless or micro service architectures. The best practices, dos and don'ts, tooling you have used etc.
Top comments (7)
hi. I was going to comment on your article that I quickly read yesterday. I wanted to ask you about other approaches and tools available on the market.
I have mostly two observability tools in mind, never tried them out, but always wondering what they are doing exactly that cannot be done with AWS solutions - do you have experience with them? is it just a matter of costs? AFAIK Cloudwatch and XRay are not cheap at all.
Sadly i do not have extensive experience with this kind of tooling and I was hoping for some experience sharing here on dev.to with regards to alternatives and how you would implement that with IaC (Terraform) for instance.
With regards to cost, there are tools in place for both CloudWatch logs and X-Ray to tweak the costs.
For Logs you should set the proper retention period and in Production you can do log sampling to reduce logging costs fairly significantly.
X-Ray as you've read, the sampling rules you can set up on the endpoint side of it allows you to do effective cost tweaking.
yep. I also hope for someone sharing his experience and opinion here!
(also have no experience in Terraform, we use Serverless framework, and recently started with CDK)
and to answer your question... so far we have lots of custom metric and alarms in CloudWatch (with email and slack notifications), we use a custom logger to provide context and correlationID to the logs ( and pass it down to other components in the stack) but we haven't set up a full-fledged solution for debugging our serverless solutions.
I've enjoyed working with Datadog in the past. We used Cloudwatch to gather logs from our Serverless functions, then ingested those into Datadog. It's very easy to use and supports a lot of other features such as monitoring, but you can just use it for logging as wlel.
With some combination of archiving logs and configuring a reasonable TTL in Cloudwatch and Datadog, the price can be manageable.
What we did for our microservices is to genrate a correlationId (or callId, or requredtId...) in the first ms of the call chain (in the API Gateway) and pass it on each resulting request from one ms to another.
Then this id would be printed in each stacktrace and each log line. Then you just have to know your call chain and you could look for the id in the different kubernetes pod consoles.
Once you identified the faulty ms, you can reproduce with an unit test by mocking the request respinsible for the error.
I was thinking more on the lines of tooling to visualize and monitor, but yea definitely correlationId's are very useful for debugging.