Discussion on: Understanding the AWS Lambda SQS Integration

View post

Great article. I'm relatively new to this area but it appears that sqs->lambda integration is way more complex/subtle than it appears to most programmers who quite often take the defaults :) Everything appears good until a lambda fails and then it difficult to recover or troubleshoot.
I've come across a lambda with these settings :
visibility_timeout_seconds = 1440
message_retention_seconds = 345600
timeout = 120

The way I read this is that if this Lambda fails to process a request successfully and continually fails it on every subsequent retry then it will be retried a (345600/1440) number of times , which is : 240 and the retry will be done every 1440 secs. Is my calculations accurate?. Thanks for your time on this.

Frank Rosner • May 15 '20

Unfortunately I could not find any documentation on this but I think you are correct. I assume that the message retention timer is not reset after changing the visibility.

The practical implication of this is that you might end up invoking your Lambda function over and over for a failing message, e.g. one that throws an exception in your Lambda code, until you "fix" your code.

However the item age seems to be reset because when I look at ApproximateAgeOfOldestMessage when one message gets constantly retried the graph looks like a sawtooth, indicating that the message age is indeed reset when the visibility is changed back to visible.

What you can do however to detect a scenario where your messages are being retried all the time is to configure an alarm on the ApproximateAgeOfOldestMessage based on the sawtooth pattern.

Does it make sense?

harkinj • May 15 '20

Thanks for the reply an the information. Did some experiments and what I outlined is actually occurring. I think the a potential way to handle this is to put in place a RedrivePolicy and control the number of retries via the maxReceiveCount setting. Unfortunately in the system I've inherited the suite of Lambda functions are not idempotent and hence I may need to set maxReceiveCount to 1 and also batch_size to 1 ( to remove partial failures) and get the message to the DLQ asap rather than retrying. lumigo.io/blog/sqs-and-lambda-the-... has some useful info. What we decide to do with messages in the DLQ will be fun :) but at least we have not lost messages that failed to be processed. Thanks for your time.