DEV Community 👩‍💻👨‍💻

Cover image for Handling errors with StepFunctions SNS SDK integration
Jones Zachariah Noel for AWS Community Builders

Posted on • Originally published at aws.hashnode.com

Handling errors with StepFunctions SNS SDK integration

AWS Step Functions is a way of designing several server-less workflow orchestrations.

When integrating with different states, there could be times when the state fails, resulting in failure of the complete execution. Like any error-handling techniques in a programming language, with Step Functions, we can also follow certain error handling techniques to gracefully terminate or retry the execution.

In this blog post, we will look at how Step Functions error handling techniques could be used with states which have an SNS SDK integration.

Error handling on Step Functions

AWS Step Functions natively supports error-handling with the catch definition.

The exceptions could occur for various reasons where the state could fail such as -

  • The state is unable to fetch/read parameters from the event passed from the previous step or invocation event JSON.
  • The state which uses SDK integration could be missing the needed permission to invoke the respective SDK API.
  • The processing time of the state could time-out.

The error names such as - DataLimitExceeded, Timeout, and Permissions define the reason for exception and the necessary steps to take to resolve it. Based on the errors, as a workflow designer, you can define if you would like to retry based on the error name or would want to handle with a catch.

Example of error handling with Step Functions with retry and catch

You can read more about error handling in Step Functions here.

Understanding the workflow

In this workflow, we will use multiple states. These states invoke different AWS Services that are executed in a parallel manner with Parallel. If any of the services result in an error, the parallel state also stops.

catch then gets executed with the state which integrates SNS SDK to publish to a specific topic about the error information.


Overview of the complete workflow which demonstrates using SNS SDK for error handling.

The parallel state executes - Lambda fn invocation, S3 SDK integrations for GetBucketACL, ListObjectsV2 and the third parallel flow is using DynamoDB SDK integration for DescribeTable.

The Parallel state with error catch definition

The catch is defined in the parallel state, and that catch then executes the step, Notify Error to SNS topic. This takes the complete error as input from the parallel state, and it maps the input to SNS SDK Publish API's Message parameter.

Notify Error to SNS topic state with SNS SDK API integration to publish to topic

Based on the parallel state, if an error occurs, either Notify Error to SNS topic, or parallel state is considered to be successfully executed as Success state.

With this workflow, if the execution encounters an exception, then it gets handled with SNS SDK integration and terminated with success. If all is well, then all the states in the parallel state also end with a success state.

{
  "Comment": "State machine to demonstrate error handling with SNS SDK integration",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Lambda Invoke",
          "States": {
            "Lambda Invoke": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke",
              "OutputPath": "$.Payload",
              "Parameters": {
                "Payload.$": "$",
                "FunctionName": "arn:aws:lambda:us-east-1:xxxxxxxx:function:ErrorSNSDemo:$LATEST"
              },
              "Retry": [
                {
                  "ErrorEquals": [
                    "Lambda.ServiceException",
                    "Lambda.AWSLambdaException",
                    "Lambda.SdkClientException"
                  ],
                  "IntervalSeconds": 2,
                  "MaxAttempts": 6,
                  "BackoffRate": 2
                }
              ],
              "End": true
            }
          }
        },
        {
          "StartAt": "GetBucketAcl",
          "States": {
            "GetBucketAcl": {
              "Type": "Task",
              "Parameters": {
                "Bucket": "textract-sample-bucket"
              },
              "Resource": "arn:aws:states:::aws-sdk:s3:getBucketAcl",
              "Next": "ListObjectsV2"
            },
            "ListObjectsV2": {
              "Type": "Task",
              "Parameters": {
                "Bucket": "textract-sample-bucket"
              },
              "Resource": "arn:aws:states:::aws-sdk:s3:listObjectsV2",
              "End": true
            }
          }
        },
        {
          "StartAt": "DescribeTable",
          "States": {
            "DescribeTable": {
              "Type": "Task",
              "Parameters": {
                "TableName": "TextractKeywordsDB"
              },
              "Resource": "arn:aws:states:::aws-sdk:dynamodb:describeTable",
              "End": true
            }
          }
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Notify Error to SNS topic",
          "ResultPath": "$"
        }
      ],
      "Next": "Success"
    },
    "Success": {
      "Type": "Succeed"
    },
    "Notify Error to SNS topic": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:xxxxxxxx:ErrorNotification",
        "Message.$": "$"
      },
      "Next": "Success"
    }
  },
  "TimeoutSeconds": 20
}
Enter fullscreen mode Exit fullscreen mode

Note : On creating the state machine, an IAM role is created, but the auto-created policies currently don't include the SDK based API policies. You would have to add the policies to the IAM role when it's created.

Different workflow executions

Execution 1 : When IAM role doesn't have dynamodb:DescribeTable permission.


The execution 1 details on AWS Console.

The parallel state starts the execution of all the three sub-processes, and as DynamoDB DescribeTable API starts, the IAM policy doesn't allow it. This causes an error, DynamoDb.DynamoDbException. Then, the parallel state catches it, and executes the Notify Error to SNS topic state. The topic has an email based subscriber which receives the following JSON based email.

JSON based email received with error description.

Execution 2 : When IAM role doesn't have S3 permission.


The execution 2 details on AWS Console.

The parallel state starts the execution of all the three sub-processes, and as S3 GetBucketAcl API starts executing, the IAM policy doesn't allow it. This causes an error, S3.S3Exception. The parallel state catches it, and executes the Notify Error to SNS topic state. The topic has an email based subscriber which receives the following JSON based email.

JSON based email received with S3 access denied.

Execution 3 : When IAM role doesn't have s3:ListObject permission but has s3:GetBucketAcl.


The execution 3 details on AWS Console.

The parallel state starts the execution of all the three sub-processes, and as the S3 process flows, it successfully executes GetBucketAcl API. Then, it shows a response, but for ListObjectv2 API, IAM policy doesn't allow it. This causes an error, S3.S3Exception. And the parallel state catches it, and executes the Notify Error to SNS topic state. Additionally, the DynamoDB operation was successful as well, as it was executing in a parallel manner. The topic has an email based subscriber which receives the following JSON based email.

JSON based email received with S3 access denied.

Execution 4 : All permissions added.


The execution 4 details on AWS Console.

With all the permissions, the states execute successfully, and because there is no exception, Notify Error to SNS topic state doesn't get executed.

Execution 5 : When Lambda function throws an error.

The execution 5 details on AWS Console.

With all the permissions, programmatic errors resulting from your Lambda function code are also handled with catch. In the Lambda function, NodeJS runtime added a snippet to throw an error.

exports.handler = async (event) => {
    // TODO implement
    const response = {
        statusCode: 200,
        body: JSON.stringify('Hello from Lambda!'),
    };
    throw new Error("An Error occured in Lambda function code!!!")
    // return response;
};
Enter fullscreen mode Exit fullscreen mode

This error is caught and gracefully handled with the Notify Error to SNS topic state, which is notified via email.

JSON based email received with Lambda function error.

Conclusion

With the error handling techniques provisioned by Step Functions, you can gracefully handle the errors. These errors could be resolved in different AWS SDK integrations with the supported 200+ services for a more automated error handling.

Top comments (8)

Collapse
 
tastefulelk profile image
Sebastian Bille

Love this series, great content Jones! 🙌

Collapse
 
zachjonesnoel profile image
Jones Zachariah Noel Author

Thanks @tastefulelk. Anything specific on Step Functions you are expecting to be covered??

Collapse
 
tastefulelk profile image
Sebastian Bille

Perhaps it's a little too specific but one thing I needed a while ago and found pretty tricky was exposing and initializing a workflow in an API and then getting getting the status/result of the entire workflow, not just the first step that's exposed in the API.

Thread Thread
 
zachjonesnoel profile image
Jones Zachariah Noel Author

So you mean more like API GW -> StepFunctions and then the response from StepFunctions -> API GW??

Thread Thread
 
tastefulelk profile image
Sebastian Bille

Yeah, so again it might be too specific - you decide. But my case was I wanted to kick off a long-ish running job from an API call and then be able to ping a status endpoint to get info on how far the job had actually proceeded through the state machine

Thread Thread
 
zachjonesnoel profile image
Jones Zachariah Noel Author • Edited on

More of a status ping back? But remember that if your complete Step Functions take more than 30s then API Gateway would timeout. Are you looking at only REST APIs for this? Or GraphQL or websocket also works for you?

Thread Thread
 
tastefulelk profile image
Sebastian Bille

Oh no, the API responded directly with a 200 saying the workflow kicked off successfully. It might help if I describe the use case:

I had a CLI app from which I wanted to let a user issue a command that executed a pretty complicated workflow. I orchestrated the workflow in StepFunctions and used an APIGW with an endpoint that exposed the first step of the State Machine and which returned a status 200 immediately. But since the workflow takes a minute or so, I wanted to be able to continuously poll for status updates on how far the workflow had proceeded to show the user what step was currently executing.

Thread Thread
 
zachjonesnoel profile image
Jones Zachariah Noel Author

Got it. Let me figure out a way to implement this. Thanks for the awesome inputs!!! 👍👍

🌚 Browsing with dark mode makes you a better developer by a factor of exactly 40.

It's a scientific fact.