DEV Community

Arpad Toth for AWS Community Builders

Posted on • Originally published at arpadt.com

Customizing error handling in Step Functions

We can customize error handling in Step Functions using the built-in Retry and Catch fields. When transferring the error-handling responsibility to Step Functions, we can write shorter and cleaner code in our applications.

1. Error handling without Step Functions

Say we want to create a workflow that consists of multiple Lambda functions and maybe some other AWS services.

We can use multiple different types of orchestration methods. Some of them are better, and some of them are less desirable.

Either way, we should handle errors, and we usually do that in the application code by writing something like this:

export async function handler(event) {
  const { taskId } = event;
  // get the task based on its ID from a 3rd party endpoint - pseudo code:
  let task;
  try {
    task = await getTaskById(taskId);
    return task;
  } catch (error) {
    logger.error('error while getting task', { taskId });
    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

When we use a queue which the function polls for messages, we usually apply some a retry mechanism. After an error has occurred, the function will attempt to get the task from the endpoint again with the same payload or parameters. If we are lucky, the call will be successful for the second or third time.

But what if the task is not found and the 3rd party endpoint responds with a 404 error? Or the validation for the event parameters fails?

The function can try again as many times as we want, but the result will always be the same - an error. Does it make sense to try it again with the same input? Probably no.

In this case, we can check if the error has a specific status code or name. A possible approach can be like this:

class NotFoundError extends Error {
  constructor(message) {
    super(message);
    this.name = 'NotFoundError';
  }
}

async function getTaskById(taskId) {
  try {
    const task = await getTaskFromThirdParty(taskId);
    return task;
  } catch (error) {
    if (error.statusCode === 404) {
      throw new NotFoundError(`Task with ID ${taskId} is not found.`)
    }

    // otherwise rethrow the original error
    throw error
  }
}

export async function handler(event) {
  const { taskId } = event;

  let task;
  try {
    task = await getTaskById(taskId);
    return task;
  } catch (error) {
    if (error instanceof NotFoundError) {
      logger.error('task not found, no point trying again');
      return;
    }

    // We'll retry when getting other errors
    logger.error('error while getting task', { taskId });
    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

This code gets ugly and hard to read with the if condition in the catch block. Imagine if we have to handle multiple types of errors differently. It might result in even more complicated code.

2. Error handling with Step Functions

If we have to coordinate multiple function calls, we can use AWS Step Functions to orchestrate the workflow. Step Functions integrates with many other AWS services, but here I'll focus on Lambda functions.

2.1. Delegate the error handling

One of the many great features of Step Functions is that we can lift error handling from the code to the state machine.

We can create a Task state from which Step Functions can invoke a Lambda function. The Workflow Studio makes it easy (ok, easier) to construct a state machine with its great drag-and-drop tool.

When we add a Lambda function to the Task state, we'll get a Retry field in the UI and the corresponding ASL definition (Amazon State Language, basically JSON with some specific features). It looks like this:

"Retry": [
  {
    "ErrorEquals": [
      "States.TaskFailed"
    ],
    "BackoffRate": 2,
    "IntervalSeconds": 2,
    "MaxAttempts": 3,
    "Comment": "Some interesting comments about the field"
  }
]
Enter fullscreen mode Exit fullscreen mode

2.2. Retry fields

It is where we can specify how we want Step Functions to process the retries.

The ErrorEquals array contains the error types for which the given retry object is relevant. States.TaskFailed includes almost all errors that can occur while a task is running, for example, exceptions we throw in the application, network or runtime errors.

So when an error occurs while the Lambda function runs, we'll try three more times (MaxAttempts). IntervalSeconds and BackoffRate will set up an exponential backoff pattern. The first retry will occur in 2 seconds (IntervalSeconds), then Step Functions will double the interval between each retry ("BackoffRate": 2) until it reaches the maximum number of attempts.

2.3. Handling custom errors

As I mentioned above, not all types of errors need a retry. We can add another error-handling object to the Retry array like this:

"Retry": [
  {
    "ErrorEquals": [
      "NotFoundError"
    ],
    "BackoffRate": 1,
    "IntervalSeconds": 0,
    "MaxAttempts": 0,
    "Comment": "No retries on NotFoundError"
  },
  {
    "ErrorEquals": [
      "States.TaskFailed"
    ],
    "BackoffRate": 2,
    "IntervalSeconds": 2,
    "MaxAttempts": 3,
    "Comment": "Some interesting comments about the field"
  }
]
Enter fullscreen mode Exit fullscreen mode

We set 0 to IntervalSeconds and MaxAttempts. It means we don't want Step Functions to try invoking the Lambda function again. BackoffRate has a mandatory minimum value of 1, so we can leave it as is.

The Retry element is an array, and Step Functions will evaluate its objects in order. So we should write our custom errors first, and States.TaskFailed, which matches any other errors (see Further reading for exceptions), will come last.

2.4. Catch fields

If all retries have been exhausted, and there is still an error at one point in the workflow, the state machine execution will fail. It can be desirable in many situations, but there might be use cases when we want to handle the exception.

For example, we want to notify the team in case of a NotFoundError. We want to process something that doesn't exist, and we'll need to fix it!

Or, we want to skip the failed step and are happy to continue with the rest of the workflow.

In these cases, we'll have to catch the errors.

One possible solution is to wrap side effects in try/catch blocks in the code as we saw above. But we can delegate this responsibility to Step Functions, which comes with a built-in Catch field.

Catch is at the same level in the state machine definition as Retry. Similarly to its friend, it'll also be an array.

It can look like this:

"Catch": [
  {
    "ErrorEquals": [
      "NotFoundError"
    ],
    "Comment": "Handling not found errors",
    "Next": "Send email notification",
    "ResultPath": "$.error"
  }
]
Enter fullscreen mode Exit fullscreen mode

The ErrorEquals array is the same as above in Retry, and we specify the errors we want to catch with the custom handler here.

The Next field contains the next state after Step Functions has caught any NotFoundError. In this case, it'll be the Send email notification state, which could be anything reasonable for the use case, for example, invoking another Lambda function or publishing a message to an SNS topic for the notification.

We can add the error object to the error property in the state output (ResultPath). If we do so, the state that throws the NotFoundError exception will have an output similar to this:

{
  "taskId": "TASK_ID",
  // ... other input properties here
  "error": {
    "Error": "NotFoundError",
    "Cause": "{\"errorType\":\"NotFoundError\",\"errorMessage\":\"Task \\
    with ID TASK_ID is not found\",\"trace\":[\"NotFoundError: Task with \\
    ID TASK_ID is not found\",\"    at Runtime.handler \\
    (file:///var/task/index.mjs:12:11)\",\" at Runtime.handleOnceNonStreaming \\
    (file:///var/runtime/index.mjs:1086:29)\"]}"
  }
}
Enter fullscreen mode Exit fullscreen mode

Of course, this JSON object will be the next state's input, in our case, Send email notification. This state is supposed to send an email with the error message to the team members.

We can add more objects to Catch or have just one with States.TaskFailed or States.ALL to run custom logic when (almost) any error types occur.

3. Putting everything together

Our super simple state machine looks like this:

State machine with custom error handling

When the first state (Get task from 3rd party) throws a NotFoundError, Step Functions will send an email notification with the error message to the subscribers before the execution fails. Don't forget to ALLOW the SNS:Publish action in the Step Function's IAM role if you choose to send a notification.

If no errors occur, Step Functions will Do something with the task in the next state.

If any other error occurs in either step, the execution will fail.

A great benefit of using the approach discussed above is that our Lambda function handler becomes very simple:

export async function handler(event) {
  const { taskId } = event;

  const task = await getTaskById(taskId);
  return task;
}
Enter fullscreen mode Exit fullscreen mode

That's it! Step Functions will handle the rest!

4. Summary

We can delegate error handling to Step Functions instead of creating complex architectures and writing hard-to-read code.

The Retry field specifies how we want Step Functions to handle retries if an error occurs in the state.

We can use the Catch field to run some custom logic when specific error types occur in a previous state. The Next property in the object will define which state should run after Step Functions has caught the errors.

5. Further reading

Error handling in Step Functions - States.TaskFailed doesn't cover all errors, here are the exceptions along with all other options

Invoke Lambda with Step Functions - The title says it all

Call Amazon SNS with Step Functions - How to configure Step Functions to publish messages to topics

Top comments (0)