Kasun de Silva for AWS Community Builders

Posted on Aug 1, 2024

Mastering AWS Step Functions Error Handling

#serverless #stepfunctions #lambda #aws

AWS Step Functions is a powerful orchestration service that enables developers to build and coordinate workflows using a series of steps, such as AWS Lambda functions, ECS tasks, or other AWS services. One of the critical aspects of building robust workflows is handling errors effectively. In this blog post, we'll dive into the different error handling scenarios in AWS Step Functions and provide practical examples to illustrate how to manage them.

Why Error Handling is Important

Error handling ensures your workflows can gracefully handle failures and continue processing without manual intervention. This not only improves the reliability of your applications but also enhances user experience by minimizing downtime and reducing the likelihood of data corruption.

Types of Errors in AWS Step Functions

States.All Errors: Catch-all for any error not explicitly caught by other patterns.
States.Timeout: Triggered when a state exceeds its allowed execution time.
States.TaskFailed: Raised when a task state fails.
States.Permissions: Occurs due to IAM permission issues.
States.ResultPathMatchFailure: When the result path doesn't match.
States.BranchFailed: Raised if a parallel state fails.
States.NoChoiceMatched: No match found for a Choice state.
States.ParameterPathFailure: When a parameter path evaluation fails.

Error Handling Strategies

Retry: Automatically retry a failed state.
Catch: Capture errors and redirect execution to a recovery path.
Timeout: Specify a maximum time a state should run.

Example Workflow

Let's create a Step Functions workflow with a few states to illustrate error handling. Our example will include a Lambda function that might fail, and we'll handle errors using retry and catch mechanisms.

State Machine Graph

Step Function Definition



{
  "Comment": "A simple state machine to demonstrate error handling",
  "StartAt": "Invoke Lambda",
  "States": {
    "Invoke Lambda": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:YOUR_LAMBDA_FUNCTION",
      "Retry": [
        {
          "ErrorEquals": ["States.ALL"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "Handle Error"
        }
      ],
      "End": true
    },
    "Handle Error": {
      "Type": "Fail",
      "Error": "LambdaFunctionFailed",
      "Cause": "The Lambda function encountered an error."
    }
  }
}

Error Handling Scenarios

1. Retrying Failed States

The Retry field allows you to retry a failed state. In the example above, the state will retry up to 3 times with exponential backoff if an error occurs.



"Retry": [
  {
    "ErrorEquals": ["States.ALL"],
    "IntervalSeconds": 2,
    "MaxAttempts": 3,
    "BackoffRate": 2.0
  }
]

2. Catching Errors

The Catch field enables you to capture errors and redirect the workflow to a different state, like an error handler or a fallback mechanism.



"Catch": [
  {
    "ErrorEquals": ["States.ALL"],
    "Next": "Handle Error"
  }
]

3. Handling Timeouts

You can specify timeouts for states to prevent them from running indefinitely.



{
  "Type": "Task",
  "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:YOUR_LAMBDA_FUNCTION",
  "TimeoutSeconds": 10
}

Advanced Error Handling

1. Conditional Error Handling with Choice State

You can use the Choice state to direct the workflow based on different error types.



"Catch": [
  {
    "ErrorEquals": ["States.Timeout"],
    "Next": "TimeoutHandler"
  },
  {
    "ErrorEquals": ["States.TaskFailed"],
    "Next": "TaskFailedHandler"
  }
]

Benefits of Conditional Error Handling

Granular Control: Allows you to define different handling strategies for different error types, improving the robustness of your workflow.
Improved Debugging: By routing specific errors to distinct states, you can more easily identify and address issues.
Customised Recovery: Enables tailored recovery actions or notifications based on the nature of the error.

State Machine Graph

Step Function Definition



{
  "Comment": "A simple state machine to demonstrate error handling including timeout",
  "StartAt": "Invoke Lambda",
  "States": {
    "Invoke Lambda": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:YOUR_LAMBDA_FUNCTION",
      "TimeoutSeconds": 5,  // Timeout after 5 seconds
      "Retry": [
        {
          "ErrorEquals": ["States.ALL"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.Timeout"],
          "Next": "Handle Timeout"
        },
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "Handle Error"
        }
      ],
      "End": true
    },
    "Handle Timeout": {
      "Type": "Fail",
      "Error": "LambdaTimeoutError",
      "Cause": "The Lambda function timed out."
    },
    "Handle Error": {
      "Type": "Fail",
      "Error": "LambdaFunctionFailed",
      "Cause": "The Lambda function encountered an error."
    }
  }
}

2. Parallel State Error Handling

For workflows with parallel states, each branch can have its own error handling strategy.

Parallel Tasks State:
- The Parallel state starts two branches: "Invoke Lambda A" and "Invoke Lambda B".
- Each branch handles retries, timeouts, and failures independently.
Error Handling in Each Branch:
- Retry: Retries the task up to 3 times with exponential backoff if it fails.
- Timeout: If a task times out, it transitions to a specific error handler.
- Catch: Captures any other errors and transitions to an error handler.
Error Handling for Parallel State:
- The Catch block in the Parallel state catches errors from any branch and transitions to the "Handle Parallel Failure" state if any branch fails.

State Machine Graph

Step Function Definition



{
  "Comment": "A state machine with parallel tasks and error handling",
  "StartAt": "Parallel Tasks",
  "States": {
    "Parallel Tasks": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Invoke Lambda A",
          "States": {
            "Invoke Lambda A": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:LambdaFunctionA",
              "TimeoutSeconds": 5,
              "Retry": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "IntervalSeconds": 2,
                  "MaxAttempts": 3,
                  "BackoffRate": 2.0
                }
              ],
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.Timeout"
                  ],
                  "Next": "Handle Timeout A"
                },
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "Handle Error A"
                }
              ],
              "End": true
            },
            "Handle Timeout A": {
              "Type": "Fail",
              "Error": "LambdaTimeoutErrorA",
              "Cause": "Lambda Function A timed out."
            },
            "Handle Error A": {
              "Type": "Fail",
              "Error": "LambdaFunctionFailedA",
              "Cause": "Lambda Function A failed."
            }
          }
        },
        {
          "StartAt": "Invoke Lambda B",
          "States": {
            "Invoke Lambda B": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:LambdaFunctionB",
              "TimeoutSeconds": 5,
              "Retry": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "IntervalSeconds": 2,
                  "MaxAttempts": 3,
                  "BackoffRate": 2.0
                }
              ],
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.Timeout"
                  ],
                  "Next": "Handle Timeout B"
                },
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "Handle Error B"
                }
              ],
              "End": true
            },
            "Handle Timeout B": {
              "Type": "Fail",
              "Error": "LambdaTimeoutErrorB",
              "Cause": "Lambda Function B timed out."
            },
            "Handle Error B": {
              "Type": "Fail",
              "Error": "LambdaFunctionFailedB",
              "Cause": "Lambda Function B failed."
            }
          }
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Handle Parallel Failure"
        }
      ],
      "End": true
    },
    "Handle Parallel Failure": {
      "Type": "Fail",
      "Error": "ParallelStateFailed",
      "Cause": "One or more parallel tasks failed."
    }
  }
}

Effective error handling in AWS Step Functions is crucial for building resilient workflows. By leveraging retry, catch, and timeout strategies, you can ensure your workflows handle failures gracefully and continue processing without manual intervention. With these techniques, you can build robust and reliable applications that can withstand various failure scenarios.

Do you have any questions or additional error handling scenarios you'd like to explore? Let me know in the comments below! Happy coding in AWS!

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (2)

Pranit Raje • Aug 2 '24

Nice article!

Enrico Mazzarella • Aug 7 '24

I would like to deep dive a similar example using AWS Batch job instead of a Lambda function.

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

This post discusses the process of migrating a growing WordPress eShop business to AWS using AWS CDK for an easily scalable, high availability architecture. The detailed structure encompasses several pillars: Compute, Storage, Database, Cache, CDN, DNS, Security, and Backup.

Read full post

DEV Community

Mastering AWS Step Functions Error Handling

Why Error Handling is Important

Types of Errors in AWS Step Functions

Error Handling Strategies

Example Workflow

Error Handling Scenarios

1. Retrying Failed States

2. Catching Errors

3. Handling Timeouts

Advanced Error Handling

1. Conditional Error Handling with Choice State

2. Parallel State Error Handling

The Next Generation Developer Platform

Top comments (2)

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

Read next

Deploying a Simple Static Website on AWS with CDK and TypeScript

Introduction to AWS (Amazon Web Services): A Comprehensive Overview for Beginners

Tips for AWS re:Invent 2025 that I’ve not read anywhere else

Calling IAM authenticated API Gateway with different HTTP clients

Okay