loading...
Cover image for Learn "the saga stepfunction" pattern today - Single Table DynamoDB, Lambdas, Step Function and API Gateway
CDK Patterns

Learn "the saga stepfunction" pattern today - Single Table DynamoDB, Lambdas, Step Function and API Gateway

nideveloper profile image Matt Coulter Updated on ・6 min read

Youtube Walkthrough

Getting The Code

This is available on cdkpatterns.com or you can run the following commands:

// Typescript version
npx cdkp init the-saga-stepfunction

// Python version
npx cdkp init the-saga-stepfunction --lang=python

The Saga Step Function

This is a pattern that I found via Yan Cui and his 2017 Blog Post.

After doing research I found some other references:

What Is The Saga Pattern?

Hector Garcia-Molina described it in his paper as follows:

Long lived transactions (LLTs) hold on to database resources for relatively long periods of
time, signficantly delaying the termination of shorter and more common transactions To alleviate these problems we
propose the notion of a saga.

A LLT is a saga if it can be written as a sequence of transactions that can be interleaved
with other transactions. The database management system guarantees that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

You can think of this as a complete transaction is made up of a series of smaller tasks. We need all of these tasks to
be successful for us to call the transaction a success.

Caitie uses a holiday booking example to demonstrate this which Yan elaborated on so let's continue the trend. If you are booking a holiday let's say you need at a minimum:

  • To Book Flights
  • To Book A hotel
  • To Pay

You wouldn't be very happy if you booked a holiday then found out when you landed that you had a reservation at the hotel but an error occured with payment so they gave it away. The saga pattern forces you to have a compensating action for that payment error, either you have some other payment selection process or you roll back the whole booking and ask the customer to try again.

Every action must have a corresponding reaction for error. Note the reaction cannot always be equal as Caitie points out, if one of the actions was to send an email you cannot undo that send but you can send a follow up to say it was an error.

If we assume from this point we will roll back when an error hits then the flow might look something like:

Success

This flows as you might expect - we reserve a room in the hotel, a spot on the plane, take the payment, then confirm the booking with the airline and hotel. Finally we notify the customer that it was a successful booking.

Alt Text

Failure

If after reserving the flight and hotel our payment fails then we need to release that reservation and notify the customer it failed.

Notice how it peels back the layers, it doesn't do one massive compensation step. It runs the cancel steps in reverse order until the system should be the way it was before we started.

Alt Text

If the first ReserveHotel task had failed the only difference is the number of Cancel tasks that run:

Alt Text

What Does The Saga Step Function Look Like?

We have an API Gateway connected to a Lambda through a {proxy+} setup. This lambda starts a stepfunction workflow representing the flows above. 8 lambdas inside that workflow communicate with 1 DynamoDB table to complete a travel booking transaction:

Alt Text

Saga Lambda and Step Fuction Exection

The Saga Lambda is a function that takes in input from the query parameters in the url and passes them to a step function execution. The data passed to the step function looks like:

let input = {
        "trip_id": tripID, //taken from queryParams
        "depart": "London",
        "depart_at": "2021-07-10T06:00:00.000Z",
        "arrive": "Dublin",
        "arrive_at": "2021-07-12T08:00:00.000Z",
        "hotel": "holiday inn",
        "check_in": "2021-07-10T12:00:00.000Z",
        "check_out": "2021-07-12T14:00:00.000Z",
        "rental": "Volvo",
        "rental_from": "2021-07-10T00:00:00.000Z",
        "rental_to": "2021-07-12T00:00:00.000Z",
        "run_type": runType //taken from queryParams
    };

Lambdas Inside Our Step Function

Author Description
Reserve Hotel Inserts a record into DynamoDB for our hotel booking with a transaction_status of pending
Reserve Flight Inserts a record into DynamoDB for our flight booking with a transaction_status of pending
Cancel Hotel Reservation Deletes the record from DynamoDB for our pending hotel booking
Cancel Flight Reservation Deletes the record from DynamoDB for our pending Flight booking
Take Payment Inserts a record into DynamoDB for the payment
Cancel Payment Deletes the record from DynamoDB for the payment
Confirm Hotel Updates the record in DynamoDB for transaction_status to confirmed
Confirm Flight Updates the record in DynamoDB for transaction_status to confirmed

Error Handling and Retry Logic

If an error occurs in any of the reserve tasks, confirm tasks or the take payment task (either by you manually passing the trigger or a real error) we have step function catch logic to route to the appropriate cancel event.

You also need to account for errors in the cancel functions. That is why there is a random fail trigger in each cancel function.

if (Math.random() < 0.4) {
    throw new Error("Internal Server Error");
}

To handle this each cancel function has a built in retry policy of 3 attempts as part of the step function definition.

DynamoDB Table

We have 3 separate entities inside the one DynamoDB table, this was inspired by Alex Debrie and his brilliant book. If you want to learn more about advanced single table DynamoDB patterns it is worth a purchase.

You can see that the sort key on our table is overloaded to allow us to effectively filter results:

Alt Text

More columns exist than is shown above. The data inserted for each record is as follows:

// Hotel Data Model
var params = {
    TableName: process.env.TABLE_NAME,
    Item: {
      'pk' : {S: event.trip_id},
      'sk' : {S: 'HOTEL#'+hotelBookingID},
      'trip_id' : {S: event.trip_id},
      'type': {S: 'Hotel'},
      'id': {S: hotelBookingID},
      'hotel' : {S: event.hotel},
      'check_in': {S: event.check_in},
      'check_out': {S: event.check_out},
      'transaction_status': {S: 'pending'}
    }
  };

// Flights Data Model
var params = {
      TableName: process.env.TABLE_NAME,
      Item: {
        'pk' : {S: event.trip_id},
        'sk' : {S: 'FLIGHT#'+flightBookingID},
        'type': {S: 'Flight'},
        'trip_id' : {S: event.trip_id},
        'id': {S: flightBookingID},
        'depart' : {S: event.depart},
        'depart_at': {S: event.depart_at},
        'arrive': {S: event.arrive},
        'arrive_at': {S: event.arrive_at},
        'transaction_status': {S: 'pending'}
      }
    };

// Payments Data Model
var params = {
      TableName: process.env.TABLE_NAME,
      Item: {
        'pk' : {S: event.trip_id},
        'sk' : {S: 'PAYMENT#'+paymentID},
        'type': {S: 'Payment'},
        'trip_id' : {S: event.trip_id},
        'id': {S: paymentID},
        'amount': {S: "450.00"},
        'currency': {S: "USD"},
        'transaction_status': {S: "confirmed"}
      }
    };

How Do I Test This After Deployment?

After deployment you should have an API Gateway where any url you hit triggers the step function to start.

You can manipulate the flow of the step function with a couple of url parameters:

Successful Execution - https://{api gateway url}
Reserve Hotel Fail - https://{api gateway url}?runType=failHotelReservation
Confirm Hotel Fail - https://{api gateway url}?runType=failHotelConfirmation
Reserve Flight Fail - https://{api gateway url}?runType=failFlightsReservation
Confirm Flight Fail - https://{api gateway url}?runType=failFlightsConfirmation
Take Payment Fail - https://{api gateway url}?runType=failPayment

Inserting Muliple trips into DynamoDB, by default it will use the same ID on every execution
https://{api gateway url}?tripID={whatever you want}

It is important to note that the Cancel Lambdas all have a random failure built in and retry logic up to a max of 3. So when you look at the execution of your stepfunction in the aws console if you see failures in the cancel lambdas this is intentional. The reason why is to teach you that the cancel logic should attempt to self recover in the event of an error. Given that they only retry 3 times it is still possible for the cancel process to fail 3 times and the step function to terminate early.

To actually view what happened you will need to log into the AWS console and navigate to the step functions section where you can see every execution of your saga step function. You can also look inside the DynamoDB table at the records inserted. If you are fast enough with refresh you can watch them go from pending to confirmed status.

Posted on by:

nideveloper profile

Matt Coulter

@nideveloper

Software Architect working on enabling engineers to rapidly deliver serverless-first solutions in a Fortune 100 organisation - passionate about Serverless, AWS, dev.to/cdkpatterns, TCO, and CI/CD

Discussion

markdown guide
 

Hi Matt, congrats on the article. It was clear and the companion source code and video were helpful.

While looking at the step function of AWS I left with the impression that it has some expectations that can limit its adoption - or force changes on how you would model your process.

For me, Sagas or the process Manager patterns are used when you have long running processes, which are distributed by nature, and want to have the transactional context. So the SEC or Process Manager is a state machine that reacts to the results from the previous state to move to the next one. This means triggering the execution of the next step.

If we look at the example you - and Caite/others - used there is a clear relationship. The stepfunction provides this state machine and the lambdas are the steps being executed.

All good so far, but what if the action being taken by this step is long-running on its own? Imagine that in an e-commerce solution you would have the steps of payment, shipment, and sending the email in that order. While payment and sending emails can likely be executed in seconds, the shipment is a long-running that could have a compensating aspect if, for example, the item is found to be out of stock or can't be sent (found to be damaged upon inspection).

In a more 'traditional' saga implementation, I would ask for the shipment to happen and simply go to sleep until I receive a message (event) with the ShipmentSent or ShipmentCancelled to resume and go to the next step.

With step functions it feels I would have to essentially break the process in two. One that would go payment, then trigger the shipment and stop. Another that would be triggered with the shipment event and have the compensating step for the payment or the successful path to send the email. And since they are different processes the state would have to be somehow recreated as I can't reuse from the previous process.

I wonder what would be your assessment on the subject because having this breakdown feels artificial but seems like the only solution if one still wants to use AWS step function.

regards

 

Pausing the step function flow until a longer running process or human approval process happens is totally possible. AWS just doesn't seem to shout about it enough. Here is an example - docs.aws.amazon.com/step-functions...

 

Thank you Matt. I will take a look in details to see if it solves the issue.