ChunTing Wu

Posted on Oct 23, 2023

Handling Stale Sets and Thundering Herds of Cache

#tutorial #python #architecture #database

Recently, I've been studying how Facebook handles caching at scale, and one of the subsections describes how they handle stale sets and thundering herds, they use a lease mechanism to face these two problems at once, which is very interesting.

In fact, I have also introduced how to solve the problem of stale sets and thundering herds, although the method used is not quite the same, but the problem we want to solve is the same, only that I have proposed my own solutions to each problem, and I do not have a one-size-fits-all solution.

Let's take a look at how Facebook solved two problems at once.

Problem Description

First, let me briefly explain the two problems.

In general, the behavior of a read-aside cache is to read from the cache first, and if it can't be found, then read from the database instead, and write the data back to the cache. If the database is updated, then the cached data is cleared without writing it back.

Even with this process, there are still problems, the most typical of which are the two problems mentioned in this article. The process of how stale sets occur is as follows.

Although B has cleared the cache, A is delayed for "some reason", so the data in the cache is written to the old data, which is the data stale or inconsistent.

On the other hand, when a cache miss occurs in a highly concurrent system, then all these clients will query the database, in other words, the database will fall down in a short time due to high concurrency. This problem is also known as the Dogpile effect.

Facebook's solution

How did Facebook solve these two problems? They use a lease mechanism, which is described in section 3.2.1 of the paper, but the explanation is not very complete, it is roughly the following two paragraphs.

A memcached instance gives a lease to a
client to set data back into the cache when that client experiences a cache miss. Verification can fail if
memcached has invalidated the lease token due to receiving a delete request for that item.

And.

We configure these
servers to return a token only once every 10 seconds per
key.

When writing data to the cache, it must carry the token that was obtained during the cache miss. The cache will first validate the token before updating, and if the token validation fails, then it will be ignored. The token will be cleared when the database is updated.

Let's explain this with the sequence diagram.

The difference between this diagram and the problem one is the first interaction will get a token, when write cache need to carry the token, but because the token has been cleared after updating the database, so write cache will be rejected, this will solve the first problem.

As for the second problem is not difficult to deal with, just need to set the rules for the token release, every 10 seconds to send a token, if the client does not get the token then there is no authority to update the cache naturally there is no need to query the database, the Dogpile effect can be solved.

Simple implementation

To implement this logic is not so difficult, let's use Python and Redis to write a simple example.

class Cache:
    def __init__(self):
        # ignore implementation
    def get(self, k):
        v = self.redis.get(k)
        token = random.getrandbits(64) # Facebook says the token has to be 64 bit.
        ret = self.redis.set(f'{self.prefix}#{k}', token, 'NX', 'EX', 10) # 10s for token refresh
        return (v, token if ret == 'OK' else None)

    def set(self, k, v, token):
        saved_token = self.redis.get(f'{self.prefix}#{k}')
        if token == saved_token:
            ret = self.redis.set(k, v)
        else:
            ret = False

        return ret == 'OK' or False

    def clean(self, k):
        self.redis.del(k, f'{self.prefix}#{k}')

To make it easier to understand, here we only show the core implementation and do not use Lua or other pipeline optimization methods.

In get, two values are returned, one is the result of the cache and the other is the token, if the original token still exists and has not expired, which means there is a possibility of thundering herds, then the token is not returned, and if the client does not receive the token, then he should not query the database.

In set, in addition to the key and value to write, we also need to carry the token, when the validation of the token fails, we don't update the cache and just end it so as to avoid stale sets.

Conclusion

Stale sets and thundering herds are pretty common problems when using caches, and there are many solutions for them. But when I actually wrote the program based on Facebook's solution, I realized that it can be so simple, which made me a bit surprised that we were "overthinking" before.

I believe that the simpler the solution, the better, and I will use Facebook's solution to deal with the problem at hand in the future. In fact, there are many more methods and optimizations for dealing with caching at scale in this paper, making it a very worthwhile study.

DEV Community

Handling Stale Sets and Thundering Herds of Cache

Problem Description

Facebook's solution

Simple implementation

Conclusion

Top comments (0)

Read next

Setting Up Virtual environment in Python Projects with Conda - 1

How to Encrypt and Decrypt Model Data Using Casts in Laravel

Day 4 - None Datatype & input() function in Python

Automating Flask & PostgreSQL Deployment on KVM with Terraform & Ansible