This is a post in English written by a Taiwanese. I try to start writing in English to reach a larger community. I'm not fluent yet, so please give me feedback and corrections if you find any errors.
Now I'm working in OTT company's payment team, we deal with customers payment transactions. We try to provide as many payment methods as we can for our customer to let them pay easily. That means we need to do a lot of payment reconciliations with 3rd party providers like PayPal, Adyen, or others.
we built this asynchronous workflow on the top of AWS Step Function. One day, our customer complaint that they pay twice in their PayPal account at 12/8, but there is only one invoice appeared in our backend console.
After checking AWS StepFunction Console, I found this error.
Rate Exceeded. (Service: AWSLambda; Status Code: 429; Error Code: TooManyRequestsException;)
- What is throttled? Why and how often does it happens?
- How to avoid Lambda.TooManyRequestsException?
- How to retry the process in Step Function?
I start to troubleshoot this rate exceeded exception according this great article lambda-troubleshoot-throttling.
The easiest way to check throttling in account level is AWS Lambda Dashboard.
It gives us an overview of account level. In this dashboard, It shows many lambda functions were throttled in 12/8. I can guess the throttling happens in account level.
According to application-level
concurrentExecutions metrics, there was no peak concurrency but still got one throttle metric at 12/8. Now the reason for the exception is really clear. The full account concurrency limit of lambda function was hit and randomly throttled this application. After that day, the SRE team requested AWS increasing account limit. However, considering the importance of this payment transaction process, increasing account limit is not enough.
Aws provides two options to allocate dedicated resources for specific lambda functions
- reserved concurrency
- provisioned concurrency
When we set up reserved concurrency, aws allocate the dedicated concurrency number only for this AWS Lambda. However, on the other hand, this AWS Lambda can only use this reserved concurrency number. It's not easy for auto-scaling. Furthermore, AWS never release these dedicated concurrency quotas even when it's redundant.
Provisioned concurrency is a builtin pre-loading feature to avoid AWS Lambda cold-start. It supports aws application auto-scaling for AWS Lambda. Compare with reserved concurrency, it provides much more flexibility, but it cost extra fee when we are warming up AWS Lambda
Both of them can protect our asynchronous transaction flow from Lambda.TooManyRequestsException, but is it the best way?
Failures happen from a variety of factors. Retry is one of the most straightforward strategies when some services fail. However, since most of our Lambda functions are not idempotent, we need to think twice before adding it as our failover mechanism.
Fortunately, we implement an error handling in our asynchronous transaction flow to do invoice reconciliation. In this case, I just need to retry that step.