Emily Fortuna for Temporal

Posted on May 24, 2023 • Edited on Jun 29, 2023

Saga Pattern Made Easy

#java #go #microservices #distributedsystems

This is part 2 of a series. To read about compensating actions, check out Compensating Actions, Part of a Complete Breakfast with Sagas.

Trip planning with sagas but without the baggage

So the last time your family went to the local park everyone was talking about the Saga Design Pattern, and now you want to know what it is, if you should make it part of your own distributed system design, and how to implement it. As we all know, software design is all about fashionable¹ trends².

The case for sagas

If you’re wondering if the saga pattern is right for your scenario, ask yourself: does your logic involve multiple steps, some of which span machines, services, shards, or databases, for which partial execution is undesirable? Turns out, this is exactly where sagas are useful. Maybe you are checking inventory, charging a user’s credit card, and then fulfilling the order. Maybe you are managing a supply chain. The saga pattern is helpful because it basically functions as a state machine storing program progress, preventing multiple credit card charges, reverting if necessary, and knowing exactly how to safely resume in a consistent state in the event of power loss.

A common life-based example used to explain the way the saga pattern compensates for failures is trip planning. Suppose you are itching to soak up the rain in Duwamish territory, Seattle. You’ll need to purchase an airplane ticket, reserve a hotel, and get a ticket for a guided backpacking experience on Mount Rainier. All three of these tasks are coupled: if you’re unable to purchase a plane ticket there’s no reason to get the rest. If you get a plane ticket but have nowhere to stay, you’re going to want to cancel that plane reservation (or retry the hotel reservation or find somewhere else to stay). Lastly if you can’t book that backpacking trip, there’s really no other reason to come to Seattle so you might as well cancel the whole thing. (Kidding!)

Above: a simplistic model of compensating in the face of trip planning failures.

There are many “do it all, or don’t bother” software applications in the real-world: if you successfully charge the user for an item but your fulfillment service reports that the item is out of stock, you’re going to have upset users if you don’t refund the charge. If you have the opposite problem and accidentally deliver items “for free,” you’ll be out of business. If the machine coordinating a machine learning data processing pipeline crashes but the follower machines carry on processing the data with nowhere to report their data to, you may have a very expensive compute resources bill on your hands³. In all of these cases having some sort of “progress tracking” and compensation code to deal with these “do-it-all-or-don’t-do-any-of-it” tasks is exactly what the saga pattern provides. In saga parlance, these sorts of “all or nothing” tasks are called long-running transactions. This doesn’t necessarily mean such actions run for a “long” time, just that they require more steps in logical time⁴ than something running locally interacting with a single database.

How do you build a saga?

A saga is composed of two parts:

Defined behavior for “going backwards” if you need to “undo” something (i.e., compensations)
Behavior for striving towards forward progress (i.e., saving state to know where to recover from in the face of failure)

The avid reader of this blog will remember I recently wrote a post about compensating actions. As you can see from above, compensations are but one half of the saga design pattern. The other half, alluded to above, is essentially state management for the whole system. The compensating actions pattern helps you know how to recover if an individual step (or in Temporal terminology, an Activity) fails. But what if the whole system goes down? Where do you start back up? Since not every step might have a compensation attached, you’d be forced to do your best guess based on stored compensations. The saga pattern keeps track of where you are currently so that you can keep driving towards forward progress.

So how do I implement sagas in my own code?

I’m so glad you asked.

leans forward

whispers in ear

This is a little bit of a trick question because by running your code with Temporal, you automatically get your state saved and retries on failure at any level. That means the saga pattern with Temporal is as simple as coding up the compensation you wish to take when a step (Activity) fails. The end.

The _why _behind this magic is Temporal, by design, automatically keeps track of the progress of your program and can pick up where it left off in the face of catastrophic failure. Additionally, Temporal will retry Activities on failure, without you needing to add any code beyond specifying a Retry Policy, e.g.,:

RetryOptions retryoptions = RetryOptions.newBuilder()
       .setInitialInterval(Duration.ofSeconds(1))
       .setMaximumInterval(Duration.ofSeconds(100))
       .setBackoffCoefficient(2)
       .setMaximumAttempts(500).build();

To learn more about how this automagic works, stay tuned for my upcoming post on choreography and orchestration, the two common ways of implementing sagas.

So to express the high-level logic of my program with both the vacation booking steps plus compensations I wish to take on failure, it would look like the following in pseudocode:

try:
   registerCompensationInCaseOfFailure(cancelHotel)
   bookHotel
   registerCompensationInCaseOfFailure(cancelFlight)
   bookFlight
   registerCompensationInCaseOfFailure(cancelExcursion)
   bookExcursion
catch:
   run all compensation activities

In Java, the Saga class keeps track of compensations for you:

@Override
public void bookVacation(BookingInfo info) {
   Saga saga = new Saga(new Saga.Options.Builder().build());
   try {
       saga.addCompensation(activities::cancelHotel, info.getClientId());
       activities.bookHotel(info);

       saga.addCompensation(activities::cancelFlight, info.getClientId());
       activities.bookFlight(info);

       saga.addCompensation(activities::cancelExcursion, 
                            info.getClientId());
       activities.bookExcursion(info);
   } catch (TemporalFailure e) {
       saga.compensate();
       throw e;
   }
}

In other language SDKs you can easily write the addCompensation and compensate functions yourself. Here's a version in Go:

func (s *Compensations) AddCompensation(activity any, parameters ...any) {
    s.compensations = append(s.compensations, activity)
    s.arguments = append(s.arguments, parameters)
}

func (s Compensations) Compensate(ctx workflow.Context, inParallel bool) {
    if !inParallel {
        // Compensate in Last-In-First-Out order, to undo in the reverse order that activies were applied.
        for i := len(s.compensations) - 1; i >= 0; i-- {
            errCompensation := workflow.ExecuteActivity(ctx, s.compensations[i], s.arguments[i]...).Get(ctx, nil)
            if errCompensation != nil {
                workflow.GetLogger(ctx).Error("Executing compensation failed", "Error", errCompensation)
            }
        }
    } else {
        selector := workflow.NewSelector(ctx)
        for i := 0; i < len(s.compensations); i++ {
            execution := workflow.ExecuteActivity(ctx, s.compensations[i], s.arguments[i]...)
            selector.AddFuture(execution, func(f workflow.Future) {
                if errCompensation := f.Get(ctx, nil); errCompensation != nil {
                    workflow.GetLogger(ctx).Error("Executing compensation failed", "Error", errCompensation)
                }
            })
        }
        for range s.compensations {
            selector.Select(ctx)
        }
    }
}

The high level Go code of steps and compensations will look very similar to the Java version:

func TripPlanningWorkflow(ctx workflow.Context, info BookingInfo) (err error) {
   options := workflow.ActivityOptions{
       StartToCloseTimeout: time.Second * 5,
       RetryPolicy:         &temporal.RetryPolicy{MaximumAttempts: 2},
   }

   ctx = workflow.WithActivityOptions(ctx, options)

   var compensations Compensations

   defer func() {
       if err != nil {
           // activity failed, and workflow context is canceled
           disconnectedCtx, _ := workflow.NewDisconnectedContext(ctx)
           compensations.Compensate(disconnectedCtx, true)
       }
   }()

   compensations.AddCompensation(CancelHotel)
   err = workflow.ExecuteActivity(ctx, BookHotel, info).Get(ctx, nil)
   if err != nil {
       return err
   }

   compensations.AddCompensation(CancelFlight)
   err = workflow.ExecuteActivity(ctx, BookFlight, info).Get(ctx, nil)
   if err != nil {
       return err
   }

   compensations.AddCompensation(CancelExcursion)
   err = workflow.ExecuteActivity(ctx, BookExcursion, info).Get(ctx, nil)
   if err != nil {
       return err
   }

   return err
}

This high-level sequence of code above is called a Temporal Workflow. And, as mentioned before, by running with Temporal, we don’t have to worry about implementing any of the bookkeeping to track our progress via event sourcing or adding retry and restart logic because that all comes for free. So when writing code that runs with Temporal, you only need to worry about writing compensations, and the rest is provided for free.

Idempotency

Well, okay, there is a second thing to “worry about.” As you may recall, sagas consist of two parts, the first part being those compensations we coded up previously. The second part, “striving towards forward progress” involves potentially retrying an activity in the face of failure. Let’s dig into one of those steps, shall we? Temporal does all the heavy lifting of retrying and keeping track of your overall progress, however because code can be retried, you, the programmer, need to make sure each Temporal Activity is idempotent. This means the observed result of bookFlight is the same, whether it is called one time or many times. To make this a little more concrete, a function that sets some field foo=3 is idempotent because afterwards foo will be 3 no matter how many times you call it. The function foo += 3 is not idempotent because the value of foo is dependent on the number of times your function is called. Non-idempotency can sometimes look more subtle: if you have a database that allows duplicate records, a function that calls INSERT INTO foo (bar) VALUES (3) will blithely create as many records in your table as times you call it and is therefore not idempotent. Naive implementations of functions that send emails or transfer money are also not idempotent by default.

If you’re backing away slowly right now because your Real World Application does a lot more complex things than set foo=3, take heart. There is a solution. You can use a distinct identifier, called an idempotency key, or sometimes called a referenceId or something similar to uniquely identify a particular transaction and ensure the hotel booking transaction occurs effectively once. The way this idempotency key may be defined based on your application needs. In the trip planning application, clientId, a field in BookingInfo is used to uniquely identify transactions.

type BookingInfo struct {
   Name     string
   ClientId string
   Address  string
   CcInfo   CreditCardInfo
   Start    date.Date
   End      date.Date
}

You also probably saw the clientId used to register the compensation in the above Java workflow code:

saga.addCompensation(activities::cancelHotel, info.getClientId());

However, using clientId as our key limits a particular person from booking more than one vacation at once. This is probably what we want. However, some business applications may choose to build an idempotency key by combining the clientId and the workflowId to allow more than one vacation at once booked per-client. If you wanted a truly unique idempotency key you could pass in a UUID to the workflow. The choice is up to you based on your application’s needs.

Many third-party APIs that handle money already accept idempotency keys for this very purpose. If you need to implement something like this yourself, use atomic writes to keep a record of the idempotency keys you’ve seen so far, and don’t perform an operation if its idempotency key is in the “already seen” set.

Benefits vs Complexity

The saga pattern does add complexity to your code, so it’s important to not implement it in your code just because you have microservices. However, if you need to complete a task (like booking a trip with an airfare and hotel) that involves multiple services and partial execution is not actually a success, then a saga will be your friend. Additionally, if you find your saga getting particularly unwieldy, it may be time to reconsider how your microservices are divided up, and roll up the ol’ sleeves to refactor. Overall, Temporal makes implementing the saga pattern in your code comparatively trivial since you only need to write the compensations needed for each step. Stay tuned for my next post, where I dig into sagas and subscription scenarios, where Temporal particularly shines in reducing complexity when working with sagas.

The full repository that uses the code mentioned in this article can be found on GitHub:

If you want to see other tutorials of sagas using Temporal, please check out the following resources:

Additionally one of my colleagues, Dominik Tornow, gave an intro to sagas on YouTube.

Learn more about Temporal in our courses, tutorials, docs, and videos.

Notes

Obviously, don’t redesign your system just because it’s the new hotness. Unless it’s a new JavaScript framework. Then npm install that new package with due haste. 😜 ↩
Don’t worry, sagas aren’t a trend; they’ve been around in databases since the 80s. You can take comfort knowing your project has a classic elegance to its design. ↩
Not that the author has absolutely any experience with this scenario whatsoever. coughs in the price of a new car 😬 ↩
Logical time is a notion in distributed computing to describe timing of events happening on different machines in distributed computing, since machines may not have a physical synchronous global clock. Logical timing is simply a causal ordering of events that occurred on these machines. In the case of long-running transactions, it basically boils down to having many “steps” that take place on different machines. ↩

Top comments (8)

Michael Tharrington • May 24 '23

Thanks so much for sharing your post here, Emily!

Really appreciated your humor throughout and the resource links + notes at the end. Awesome writing on an interesting topic! 🙌

Emily Fortuna • Jun 15 '23

Thank you, @michaeltharrington!

Sergiy Yevtushenko • May 28 '23

Saga pattern is specific to microservices rather than distributed systems in general. Microservices is one of those primitive distributed systems which do not maintain consensus. This results in inability to perform any tasks which require coordination between several nodes.

Implementation of Saga pattern ought to be done at business logic level and each type of transaction must be implemented separately. Each implementation is fragile and error prone. Mixing business logic and Saga steps makes code hard to read, understand, support and maintain. Finally, Saga pattern does not provide any kind of isolation between transactions. This may result in subtle, hard to nail down and fix bugs. Those bugs may cause data loss or corruption.

Emily Fortuna • Jun 15 '23

Aren't steps of your Saga business logic in some form? How do you propose to separate them?

Sergiy Yevtushenko • Jun 15 '23

Usually they describe states of some business entities, for example orders. Such states usually modelled as dedicated fields in those business entities. Such an approach allows easy handling of transitions using traditional generic transaction mechanisms. But this option assumes shared data, which is a big "no" for microservices.

Emily Fortuna • Jun 27 '23

Ah, got it. You're arguing this is an issue with the saga pattern itself. It's definitely not always the first choice. 2 phase commit is "cleaner", but there are cases where 2PC is not possible (throughput is more important, and 2PC has the possibility to hang)

Sergiy Yevtushenko • Jun 28 '23 • Edited

You've missed whole point: distributed transaction in microservices is the problem, not a particular way to perform them. Microservices by design, intentionally reduce coupling between nodes (services) down to the level, where coordination necessary to perform distributed transaction is impossible. Moreover, in properly designed microservices-based system they should be unnecessary, as microservice governs all necessary data and can use regular (and not distributed) transaction mechanisms.
I think that initial idea of microservices was understood incorrectly and this results in all sorts of issues. Handling connectivity issues and transactions at business logic level are among ugliest ones as they screw up design and result in fragile and error prone implementation. But if we return to the original idea, everything gets natural and does not cause issues mentioned above:

no distributed transactions and no fragile logic which needs to be created on case-per-case base (saga) or handled across unreliable nodes (2PC) at business logic level
connectivity issues are natural part the business logic in this case. It's sort of browser - web site situation: when we're typing URL in browser or clicking link at page, we're expecting that this operation may fail for various reasons, including connectivity. Retrying and/or postponing request also part of business logic in this case.

If we still need distributed transactions because domain is too big to fit into one service, we just need to use design approaches which support them. There is nothing wrong with that (quite the opposite), but technically those approaches are not microservices and should not be treated as such. For example, requirement to govern all data for EDA components (which usually also called "services") is not applicable to them. They are by design tightly coupled via reliable data storage - streaming/queuing or pub-sub infrastructure.

In other words, each type of design has its own area of application and problem arises when they are used incorrectly. And main cause of improper choice of design - incorrect selection of domain boundaries. We're so focused on technical side that forget that domain is defined by business. And vast majority of businesses have only one domain. Only large and very large organizations usually are big enough to have more than one domain. Such companies are rare enough and therefore microservices should be rare too.

I've made a more detailed article dedicated to the issue.

Adophilus • Aug 1 '23 • Edited

I'm too dumb to understand this 😭. I'll try reading it again

DEV Community

Saga Pattern Made Easy

Trip planning with sagas but without the baggage

The case for sagas

How do you build a saga?

So how do I implement sagas in my own code?

Idempotency

Benefits vs Complexity

Notes

Top comments (8)

Read next

🎄 Advent of Code 2024: Day 2 Completed! 🚀

Building Microservices with .NET Core and Kafka: Order Processing

How to add a meter data into Spring Boot Actuator metrics endpoint

Quick look on how Spring Boot supports JMS