loading...
Microsoft Azure

Serverless circuit breakers with Durable Entities

jeffhollan profile image Jeff Hollan ・9 min read

I remember a call with a customer from my first few months on the Azure Functions team. The customer’s function triggered from a queue message and ran a stored procedure in a SQL database. The solution had been working fantastic for months at a high scale until one fateful day the SQL database crashed. What resulted is a serverless version of kicking someone while they're down. The function executions started to fail, and as they failed, the message would be re-queued and retried (which is great! Retrying generally is a best practice, especially for transient failures). But more queue messages kept landing in the queue, in addition to a growing number of retries. Within minutes they had gone from hundreds of failed executions to thousands. What's worse, once the SQL database failure was resolved, Azure Functions was ready to hammer that poor SQL database back into a failed state with a mountain of queue messages and retries it had been holding up. Eventually, things got back into a steady-state, but the team was left with thousands of messages in a dead-letter queue and some battle scars. It's a powerful reminder that sometimes serverless scale can make a bad situation worse.

What the customer wanted was simple enough to understand.

"Can we make it so our function stops triggering if lots of failures start happening?"

It’s a simple problem that has been asked by many a developer. I want to show how you can use the new Durable Entities feature of Azure Functions to create a stateful circuit breaker to achieve exactly that.

Traditional circuit breakers and serverless scale

The pattern isn't a new one. Generally, it's known as the circuit breaker pattern. You can think of it as an electric circuit with a gate. Imagine electricity is flowing through a circuit to a destination. A gate is there bridging the connection. If you ever need to stop the flow of electricity, the gate "opens" creating a gap in the circuit, and stopping electricity. When electricity should resume, you "close" the gate connecting the circuit.

Many libraries exist that can implement this pattern in traditional apps. Polly for .NET is my favorite, but there are plenty to choose from. You can define a threshold number of exceptions within a range - if I get more than 30 exceptions in 1 minute - and it "opens" the gate to stop processing. While this works for many applications, serverless presents new challenges where this model doesn't fit.

Single server vs distributed apps

Assume I'm processing all of my queue messages on a single server. Here, libraries like Polly have worked great. Polly keeps an in-memory count of exceptions as they occur, and it can easily keep track of the threshold because every exception will only occur on that single instance.

Distributed applications like serverless functions are much different. Within seconds under load, an Azure Function can scale to dozens of instances. Imagine now that you have 50 active function instances triggering on events. You could have a circuit breaker library running in your app - but you'd have 50 instances of that circuit breaker library. Now imagine that you get 50 exceptions all at once, but each exception spreads evenly across your instances. Each library, like Polly, will think everything is healthy. "Hey I'm good, I only saw one exception, and my threshold is 30, so let's keep the circuit closed." But the global story is much different. Globally the threshold was crossed, but because each instance was processing in isolation, there was nothing that knew the state and health of the entire circuit. Not to mention, even if one instance "halted" processing, the other 49 may keep chugging along.

The scaling and abstraction of all of these instances make traditional circuit breaker libraries unpredictable. Sometimes 30 exceptions could open the circuit breaker if they all happen on a single instance, while other times it could take many many more exceptions to get a successfully opened gate.

State management and distributed circuits

The missing piece here for a distributed circuit breaker is some external state that can monitor the health and status of the entire circuit. There's no single answer on how that state could manifest. The state could be Azure Alerts and metrics that aggregate exceptions and take some action. The state could be an Azure Logic App like an example I created a few years back before we had Durable Entities. My current weapon of choice though is a Durable Entity.

Let's step away from this circuit breaker problem for a beat and introduce what Durable Entities are.

Durable Entities and stateful functions

Serverless functions are no longer the stateless ephemeral snippets of code you've likely heard of. With new capabilities like Azure Durable Functions you can now write Azure Functions that maintain state for an indefinite amount of time. This is extremely useful when you need to orchestrate or coordinate work as it moves through a system.

Recently we've released a new flavor of Durable Functions called Durable Entities. You can think of Durable Entities like a function that can have infinite instances, each with its unique state and ID. The basic example is something like a counter. Let's say you've now become the lead architect for Fitbit and need to build a solution that receives step counts for each of the thousands of users and keep them stored in state. Each Fitbit device will only ever send one signal: "Add Step." You could write a stateless function that interacts with a database, and does something like "lookup this user, get the current step count, add one, and save it back to the database." But it's a bit cumbersome to write, and things start to get messy when I think about how I can prevent two signals from executing at the same time on the same user, and what was two unique step signals only end up incrementing the database by one.

Durable Entities lets me describe that entity - an instance of a counter - while maintaining state and single-threaded operations on each instance. The code ends up looking like this:

public class Counter
{
    [JsonProperty("steps")]
    public int CurrentValue { get; set; }
    public void AddStep() => this.CurrentValue++;
    public void Reset() => this.CurrentValue = 0;

    [FunctionName(nameof(Counter))]
    public static Task Run([EntityTrigger] IDurableEntityContext ctx)
        => ctx.DispatchAsync<Counter>();
}

That's it - that's an entire Azure Function app. Once I publish this, I can use the Durable SDK or HTTP API to do things like "AddStep for Jeff" or "AddStep for Chris." I could have infinite instances of these counters, each one storing the state for that user. And these Durable Entities are just as serverless as Azure Functions. If I send 1,000 steps for Jeff, my function will scale and run and process those 1,000 events (guaranteeing that after all 1,000 are processed my count is "1000"). I pay only for those 1,000 steps. If I never call that function again, the state still lives (in Azure Storage by default), but I never pay for the function compute until I use it again.

Durable Entities make managing state extremely intuitive. We've seen tremendous interest in scenarios like IoT, where each IoT device could persist and expose state operations as a durable entity. My personal favorite use of entities though is to solve our circuit breaker dilemma.

Durable Entities and Circuit State

Let's pull these threads back together. If you recall from the beginning, we have a problem: we need to be able to manage the state and rate of exceptions external to each function. We also have a pretty slick new tool of Durable Entities. Now I want to show how you can combine these for a serverless circuit breaker.

First, let's break down the flow of how things will work. There are two main components. The function app that will be triggering and processing messages from something like an Azure Event Hub or queue, and a durable entity that will monitor and manage the state of the circuit.

Azure Function

  • Trigger and try to process the message
  • If there is an exception, send a signal to the durable entity to let it know it hit an exception
Azure Function Code
[FunctionName("MyFunction")]
public async Task Run([QueueTrigger("myqueue")] Message message, ILogger log)
{
    try
    {
       // try to process the message
    }
    catch (Exception ex)
    {
       // hit an exception
       // send a signal to the Durable Entity
       await _client.PostAsJsonAsync(entityUrl, new FailureRequest
            {
                FailureTime = DateTime.UtcNow
            });               

       // Throw the exception so the retries will kick in
       throw ex;
    }
}

Durable Entity

  • Keep track of how many exceptions have been reported across all scaled out instances of a function
  • If several exceptions within a certain period are reported, break the circuit
  • When breaking the circuit, use the Azure API to stop the Azure Function
Durable Function Code
[JsonObject(MemberSerialization.OptIn)]
public class Circuit
{   
    [JsonProperty]
    [JsonConverter(typeof(StringEnumConverter))]
    public CircuitState state = CircuitState.Closed;

    // Current rolling window of failures reported for this circuit
    [JsonProperty]
    public IDictionary<string, FailureRequest> FailureWindow = new Dictionary<string, FailureRequest>();

    public void CloseCircuit() => state = CircuitState.Closed;
    public void OpenCircuit() => state = CircuitState.Open;

    public async Task AddFailure(FailureRequest failure)
    {
        // Check to make sure the circuit isn't already opened
        if(state == CircuitState.Open)
        {
            _log.LogInformation($"Tried to add additional failure to {Entity.Current.EntityKey} that is already opened.  Close the circuit to resume processing");
            return;
        }

        // Add this failure to the stateful aggragate
        FailureWindow.Add(failure.RequestId, failure);

        // Calculate the time window we should evaluate exceptions for
        var thresholdCutoff = failure.FailureTime.Subtract(windowSize);

        // Filter the window only to exceptions within the cutoff timespan
        FailureWindow = FailureWindow.Where(p => p.Value.FailureTime >= thresholdCutoff).ToDictionary( p => p.Key, p => p.Value);

        if(FailureWindow.Count >= failureThreshold)
        {
            _log.LogCritical($"Break this circuit for entity {Entity.Current.EntityKey}!");

            // Kick off a call to disable the Azure Function App
            await _durableClient.StartNewAsync(nameof(OpenCircuitOrchestrator.OpenCircuit), failure.ResourceId);

            // Mark the circuit as open
            state = CircuitState.Open;
        }
        else 
        {
            _log.LogInformation($"The circuit {Entity.Current.EntityKey} currently has {FailureWindow.Count} exceptions in the window of {windowSize.ToString()}");
        }
    }

    [FunctionName(nameof(Circuit))]
    public static Task Run(
        [EntityTrigger] IDurableEntityContext ctx) => ctx.DispatchAsync<Circuit>(client);

    }

There are other ways you can chain these together - like adding some in-memory retries to the function or even having the function explicitly check the state of the circuit before processing the message, but this flow is my favorite and optimizes for high throughput and low cost. I can deploy the durable entity and function app, and once the durable entity detects the number of failures is too high it will automatically disable the Azure Function app so it stops processing. It is worth noting you can use the same durable entity to monitor and manage the circuit for many different function apps in your subscription at the same time.

The full code for my sample is on GitHub.

Sample circuit breaker scenario

To help make this pattern clear, let's walk through an example. Imagine I have the Azure Function above and the Durable Entity deployed in my subscription. We'll use the same scenario as we started this post with - the function triggers on the queue, interacts with a SQL server, and completes processing.

My function could be running for months without issue. All during this time, my durable entity is sitting idle (and free) in my subscription. Assume the function is triggering on queue messages across 100 active instances.

Suddenly, the SQL server starts having some issues. A few of the messages start throwing exceptions. An instance hits an exception and lets the durable entity know. The durable entity keeps track that 1 exception has happened in the last 30 seconds. Moments later, 20 other instances hit an exception. Each sends a signal to the durable entity, which now knows that 21 exceptions have happened in the span of a few seconds. Finally, the failures start to mount, and quickly the durable entity detects the threshold has crossed - it has the state for over 30 exceptions in a 30 second window.

The logic activates the breaking and opening of the circuit. It makes a call to the Azure APIs and disables the function app so it stops triggering. No messages are lost - the queue messages stay safely in the Service Bus queue. But rather than creating cascading failures, I've gracefully broken the circuit until the health of the systems can be confirmed, the circuit closed, and processing resumes.

With durable entities and Azure Functions we've very efficiently solved the problem we started with:

"Can we make it so our function stops triggering if lots of failures start happening?"

More than just functions

You can use Durable Entities to manage the state of any distributed app. Polly recently announced support for Durable Entity powered circuit breakers that can manage the state of any application. I'd encourage you to check them out, and give durable entities a spin!

Posted on by:

jeffhollan profile

Jeff Hollan

@jeffhollan

PM for Azure Functions. I do what I love - and I love serverless

Microsoft Azure

Any language. Any platform.

Discussion

markdown guide
 

Hi Jeff,
This sounds great for many of our common problems, however I can see this almost works for our largest problem (but not quite). Think of the SQL Database in your example above and imagine the database is an Azure SQL Database with a set amount of DTUs assigned. You then have a queue as in your example. We would want to somehow throttle the message queue to consume as many DTUs as possible without just causing a flood of throttling errors (Http Status 429s).
Can you see a nice way to use this pattern to throttle rather than circuit break?
It really feels like there's an elegant way to use this to achieve this.
This is effectively the situation where the circuit closes again and potentially all the newly queued work while the circuit was open then overwhelms the database immediately.

Bryden

 

Hi Bryden

"We would want to somehow throttle the message queue to consume as many DTUs as possible"

Isn't this just limiting the number of items in the batch read from the queue based on load test from a perf environment? You could even make the number configurable and set it externally from scaling logic on your Azure SQL db

 

Thanks Bryden - I think what you’ve described above I spot more into rate limiting and throttling than circuit break. A feature we’ve wanted to implement on a more granular level - hopefully in upcoming months will at least have number of instance throttling in all plans (now just a premium plan feature)

 

Jeff,
While that will be useful, instance throttling is very much a blunt force instrument. We are currently looking at a resource based consumption throttler. So monitoring the use of each resource that gets used and has some sort of rate limit associated. Then we are building something that will handle that accordingly. I'll sit down and thrash out in detail whether we can get something working that would handle this nicely.
In particular the reason we are looking at this is that we are effectively partitioning our data across multiple rate limited pieces of storage, so in theory we'd like to avoid the situation where we are limiting based on our lowest throughput piece of storage.
Rather we'd like to circuit break those calls quickly and return a 429 or similar to the caller and continue to consume all of our available throughput on everything else.
At it's worst, the durable entities sound like a potentially better storage for this than the Redis cache we are using now.
I guess what I'm saying is that I suspect for most consumers, instance level throttling potentially won't cut it (also because a single instance is quite capable of completely overwhelming a downstream resource all on its own). So investigating a more granular level of throttling would be well worthwhile. For now, we are quite happy to continue investigating, but if you guys had some clever thoughts that might improve our direction that would be great.

Yes makes sense. Would be interested to learn more what you are thinking. Above the "blunt force" instance limiting we have been evaluating execution limiting, but what you are describe sounds even more granular than that. Almost something like "I have 400 locks for SQL, 2000 locks for Azure Storage -- hey functions, do your thing, but before you can run this line of code you need to make sure you have a lock first." Is that accurate?

 

How would you handle rate/throttling limits from a downstream api inside your azure function? You can try to retry but what if the retrying takes longer than the azure function default timeout. Some downstream api provide a retry-after time what if it exceeds the 5minute default timeout?

 

Great article Jeff.

To break the circuit rather than stopping the entire function app, could also just disable the specific function in the Function App with the queue binding or update an ENV variable on the app service which the code uses a feature flag/toggle as it's cheap to read the current stats of the circuit open/closed.

 
 

Hi Jeff,

How can I use Cosmos or Redis as my storage in Durable Functions?
Thanks!

Djalma Freitas

 

Hi Jeff!

Thanks for great post! Maybe a stupid question from me. As I can see circuit breaker stop the function, but how I can start function again, automatically.