DEV Community

Cover image for Lessons we've learned after burning many thousands thanks to AWS Lambda. Expect no mercy from AWS.
Aleksandr Zakharov
Aleksandr Zakharov

Posted on

Lessons we've learned after burning many thousands thanks to AWS Lambda. Expect no mercy from AWS.

Preface.

A year ago, we decided to make a transition towards serverless architecture. Our management was very excited about it, and its excitement resulted in many tries and failures for developers(including me). So one Monday, we started our working day and realized that one of our lambdas had been going right into the rabbit hole the whole weekend. We were astonished, management was dissatisfied, and I was happy with the new material for the current article.

Our setup.

The staple part of several microservices at our disposal heavily relies upon S3 event notifications. So what happened.
A developer screwed up and invoked Lambda from within the same Lambda for the same file in S3, which initially triggered Lambda. These invocations created other S3 files, which started different lambdas... You got the idea.

Dev wasn't fired or sanctioned in any way. Because it's an architectural problem, anyone can make a silly mistake.

How much we've lost? Tenths of thousands.

We filled the ticket afterward and got compensated 5k only because we spent this much before the alarm came through.

Precautions we implemented to prevent future incidents.

  1. We set budget notifications and created alarms to email, slack channel, and mobile phone of key tech company figures.

  2. Most of the Lambdas must have reserved concurrency parameters set.

  3. Most of the Lambdas must be invoked via SQS only.

  4. We also implemented AWS Config rule to check all our Lambdas for reserved concurrency.

With reserved concurrency, we avoid calling functions more than we should. This way, essentially throttle it.

And SQS helps us to prevent data loss. In case of facing concurrency limit, Lambda will wait before obtaining the following message from the queue.

Questions to think about.

  1. Why is there no option to kill all AWS activities after reaching some usage threshold?
  2. Is it this complicated to create an intelligent tool to help AWS customers catch this situation and avoid money loss?

Discussion (6)

Collapse
dvddpl profile image
Davide de Paolis

Could you provide a bit more insight about the architecture of your system and what exactly triggered this crazy bill? Your suggestions are absolutely valid but it it would be useful to understand the context and the causes.

Collapse
xezed profile image
Aleksandr Zakharov Author

Thank you for suggestion :)
Will provide more details. Just wanted warn anyone who deal with serverless and lambda that this situation is not smth unheard-of.

Collapse
alikhajeh profile image
Ali Khajeh-Hosseini

Sorry to hear that :( FWIW you're not alone, I've heard many cloud cost horror stories over the years.

Is it this complicated to create an intelligent tool to help AWS customers catch this situation and avoid money loss?

Cost estimation can be pretty complicated. With github.com/infracost/infracost we're building an open source tool to help engineers get cost estimates before launching resources, initially we're focused on Terraform (so it couldn't have helped in you unfortunately) but we have plans to go beyond that. It's taken a fairly large community over 6 months to code the price mappings for ~200 cloud resources across AWS/Azure/GCP.

What don't you love about the AWS Config solution you implemented? I'm wondering if we should code-up the precautions from your article into infracost...

Collapse
patrykmilewski profile image
Patryk Milewski

If you are using Node.js, then just use this in the future:
github.com/getndazn/dazn-lambda-po...

Collapse
xezed profile image
Aleksandr Zakharov Author • Edited on

wwwoof, thank you :)
but unforunately our choice is python

Collapse
yehudafitterman profile image
Yehuda Fitterman

We in Lumigo have a solution for that and sounds like we can add some other value points.
Would love to schedule an intro call to show you how you can use us.:)
Yehuda@lumigo.io