DEV Community

Todor Todorov for AWS Community Builders

Posted on

Automating Redrive from DLQ for FIFO SQS Queue

The case:

Dead Letter Queues (DLQs) are a powerful feature of Amazon Simple Queue Service (SQS) that allows you to isolate and handle failed messages. Messages are sent to the DLQ for further analysis and troubleshooting when they fail to be processed successfully. AWS provides an automated way to redrive messages from the standard queues but as of today there is no such option for the FIFO queue.
Having that said and because of the need to gain some robustness in my day-to-day life as DevOps Engineer, once the underlying issues are resolved, it's essential to automate the process of redriving these messages back to the original FIFO queue for successful processing avoiding the error-prone approach of human taking manual action on message by message basis.
In this blog post, we'll explore how to automate the Redrive process using an AWS Lambda function and an infrastructure-as-code (IaC) template.

Solution:

Prerequisites:
Before diving into the automation process, make sure you have resolved the root cause that led to messages ending up in the DLQ. Redriving messages without addressing the underlying issues can result in an endless loop of failures. It's important to thoroughly test and validate your application code to ensure it can successfully process the redriven messages.

Infrastructure as Code (IaC) Template:
We'll utilize an AWS CloudFormation template written in YAML to deploy the Lambda in an IaC way. This template provisions an AWS Lambda function that reads all the messages from the DLQ and sends them back to the original FIFO queue. Here's the IaC template:

AWSTemplateFormatVersion: 2010-09-09
Description: Automate Redrive from DLQ for FIFO SQS queue

Transform:
- AWS::Serverless-2016-10-31

Resources:

  LambdaFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/lambda_function.handler
      Runtime: python3.10
      Architectures:
        - x86_64
      MemorySize: 128
      Timeout: 300
      Description: A Lambda function that reads all the SQS messages from DLQ and sends them back to the original FIFO queue. NB! Ensure you have resolved the issue which caused messages to end up in DLQ before running this function.
      Policies:
        - AWSLambdaBasicExecutionRole
        - SQSPollerPolicy:
            QueueName: '*'
        - SQSSendMessagePolicy:
            QueueName: '*'
        -
          Version:  '2012-10-17'
          Statement:
            Effect: Allow
            Action:
              - sqs:ListQueues
              - sqs:GetQueueUrl
              - sqs:ListDeadLetterSourceQueues
              - sqs:ListQueueTags
            Resource: !Sub arn:aws:sqs:${AWS::Region}:${AWS::AccountId}:*
Enter fullscreen mode Exit fullscreen mode

This CloudFormation template defines a serverless AWS Lambda function that will be responsible for reading messages from the DLQ and sending them back to the original FIFO queue.

In order to deploy the SAM template, you will need to execute the following commands

sam build
sam deploy --guided
Enter fullscreen mode Exit fullscreen mode

Application Code:
To implement the functionality of the Lambda function, we'll use Python and the AWS SDK, Boto3. Here's the code:

import boto3

def handler(event, context):
    sqs = boto3.client('sqs')

    dlq_url = event['dlq_name']
    target_queue_url = event['target_queue_name']

    while True:
        response = sqs.receive_message(
            QueueUrl=dlq_url,
            MaxNumberOfMessages=10,
            AttributeNames=['MessageGroupId']
        )

        if 'Messages' not in response:
            print ('No Messages in DLQ')
            break

        for message in response['Messages']:
            response = sqs.send_message(
                QueueUrl=target_queue_url,
                MessageBody=message['Body'],
                MessageGroupId=message['Attributes']['MessageGroupId'],
                MessageDeduplicationId=message['MessageId']
            )

            if 'MessageId' in response:
                # Message redriven successfully, delete it from DLQ
                delete_response = sqs.delete_message(
                    QueueUrl=dlq_url,
                    ReceiptHandle=message['ReceiptHandle']
                )
                if 'ResponseMetadata' in delete_response and delete_response['ResponseMetadata']['HTTPStatusCode'] != 200:
                    raise Exception('Failed to delete message from DLQ: {}'.format(delete_response))
            else:
                # Handle error case when redriving message fails
                print('Failed to redrive message: {}'.format(response))

    return {
        'statusCode': 200,
        'body': 'DLQ messages redriven successfully.'
    }
Enter fullscreen mode Exit fullscreen mode

This code defines the Lambda function's handler, which takes the event and context as input parameters. The function utilizes the Boto3 SQS client to retrieve messages from the DLQ (dlq_url) and sends them back to the original FIFO queue (target_queue_url). It also handles the deletion of successfully redriven messages and reports any failures.

Deployment and Configuration:
To deploy the solution, you can use the SAM template provided. Customize the template parameters according to your requirements, such as the DLQ name and the target FIFO queue name. Once deployed, the Lambda function could be triggered on-demand to process messages from the DLQ with the following payload:

{
    "dlq_name": "my-sqs-queue-dlq.fifo",
    "target_queue_name": "my-sqs-queue.fifo"
}
Enter fullscreen mode Exit fullscreen mode

NB! You need to change the values above as appropriate to your case.

To ensure the successful execution of the Lambda function, make sure the IAM role associated with the function has appropriate permissions. The provided template includes the necessary policies for accessing the SQS queues and executing Lambda functions.

Monitoring and Error Handling:
It's crucial to monitor the execution of the Lambda function and handle any potential errors. You can enable logging for the Lambda function and configure a CloudWatch Log Group to capture logs. Monitor the logs for any failures during the redriving process and take appropriate action to rectify the issues.

Additionally, consider setting up alarms and notifications using Amazon CloudWatch. Configure alarms to trigger notifications when the Redrive function encounters errors or exceeds certain thresholds, ensuring timely intervention and troubleshooting.

Conclusion:
Automating the Redrive process from the Dead Letter Queue for a FIFO SQS queue is crucial to ensure the successful processing of failed messages. By using AWS Lambda and the provided infrastructure as code template, you can set up a scalable and reliable solution. Remember to address the root cause of message failures before initiating the Redrive process.

By leveraging the power of AWS services and automation, you can enhance the resiliency and reliability of your message-processing workflows. With proper monitoring and error handling, you can ensure the smooth and efficient operation of your FIFO SQS queues.

Top comments (0)