DEV Community 👩‍💻👨‍💻

ChunTing Wu
ChunTing Wu

Posted on

Solving Dogpile Effect Completely in Code

Last time, we talked about how to solve Dogpile Effect.

https://betterprogramming.pub/solving-dogpile-effect-9d869174d302

Three approaches are mentioned as follows.

  1. Warm Up Cache
  2. Extend Cache Time
  3. Exclusive Lock

However, we also mentioned that each of the three approaches has its own applicable scenario and its corresponding potential risks. So is there a way to extract the advantages of each and construct a more complete solution?

This article will provide an example and explain my ideas.

Solution Concept

Extending the cache time effectively improves the availability of the cache. When the cache is not valid and is requested at the same time, only one request can get through the cache and into the backend system. The rest of the requests will get the original result as the cache time is extended.

However, when concurrent requests occur at the same time ( which is generally uncommon), there is still the possibility of multiple requests entering the backend system, hence the exclusive lock approach.

Nevertheless, the cost of using exclusive locks all the time is too high, and we should try to minimize the use of exclusive locks if possible. Then, use the exclusive lock only when the cache does not exist and there is a need to access the backend system, else just use the extend cache time.

The whole process is as follows.

Image description

First of all, determine whether the cache exists, if the cache exists we still have to determine whether the cache is expired. If everything is fine, we can just take the original value of the cache, but if the cache is expired, we must enter the process of updating the cache.

In order to avoid the impact of high concurrent requests, all update cache processes should try to acquire a lock.

On the other hand, if the cache does not exist from the beginning, then the process of updating the cache will be the same. Only the process is different as mentioned above, because there is no original value, so those who did not acquire the lock must wait for the lock before they can get the result.

Solution Overview

Before we get into the details of the implementation, let's look at the actual practice.

def read_aside_cached(ttl, lock_period, race_period):
  def decorator(func):
    def wrap(*args, **kw):
      key = f"{func.__name__}_{args}_{kw}"
      return cache_factory(key, ttl, lock_period, race_period).handle(func, *args, **kw)

    return wrap
  return decorator

@read_aside_cached(60 * 5, 30, 60)
def foo(a, b=1, c=2):
  return db.query(a, b, c)
Enter fullscreen mode Exit fullscreen mode

This is an example in Python where we use a decorator to encapsulate an actual database operation.

This decorator requires several parameters.

  1. ttl, this is easy to understand, it is the expiry time of this cache.
  2. lock_period, because we need to acquire the lock, so this parameter determines how long we have to lock.
  3. race_period, this parameter is used to determine how long we want to extend the cache.

In the above example, foo has a cache expiry time of 5 minutes and retains a 1 minute buffer. The lock time is 30 seconds, which is related to the expected time of the database operation.

Solution Details

Next, let's break down the actual details of the flowchart.

def cache_factory(key, ttl, lock_period, race_period):
  value, expired_at = Store.get(key)

  if expired_at is not None:
    handler = ExistedCacheHandler(key, ttl, lock_period, race_period)
  else:
    handler = NonExistedCacheHandler(key, ttl, lock_period, race_period)

  handler.set_meta(value, expired_at)
  return handler
Enter fullscreen mode Exit fullscreen mode

At the beginning of the flowchart, we need to try to get a cache first and use the results to see if we need to extend the cache time.

The top and bottom paths of the flowchart are encapsulated by each class. Let's look at the implementation of ExistedCacheHandler first.

class ExistedCacheHandler(BaseCacheHandler):
  def handle(self, func, *args, **kw):
    if self.now > self.expired_at and Store.try_lock(self.key, self.lock_period):
      result = func(*args, **kw)
      Store.set(self.key, result, self.ttl + self.race_period)
      Store.unlock(self.key)
      return result

    return self.orig_val
Enter fullscreen mode Exit fullscreen mode

If a cache has expired and successfully acquires a lock, it is responsible for updating the cache.

In the previous article, we introduced the Rails approach, where Rails writes the original value back to the cache again and extends the valid time slightly. But here, we directly let the cache time be (ttl + race_period), so we don't need to extend the cache time manually.

On the contrary, if the cache has not expired or has not been locked, then the original result in the cache is used.

On the other hand, the logic of cache not exist is more complicated.

class NonExistedCacheHandler(BaseCacheHandler):
  def handle(self, func, *args, **kw):
    while self.expired_at is None:
      if Store.try_lock(self.key, self.lock_period):
        result = func(*args, **kw)
        Store.set(self.key, result, self.ttl + self.race_period)
        Store.unlock(self.key)
        return result
      else:
        while not Store.try_lock(self.key, self.lock_period):
          time.sleep(0.01)
          self.orig_val, self.expired_at = Store.get(self.key)

        Store.unlock(self.key)
    else:
      return self.orig_val
Enter fullscreen mode Exit fullscreen mode

When a cache is found to be non-existed, we still have to acquire the lock in order to update the cache. But if the lock is not acquired successfully, we must wait, either for the lock or for the cache to be updated.

Why should we wait for either of these two conditions?

The reason is the person who acquired the lock may not have released the lock for "some reason". Our ultimate goal is to get the cache result, so even if we don't get the lock, we still get the result. Of course, if the lock is successfully acquired, the responsibility of updating the cache will be taken up.

Finally, let's look at two common components.

class Store:
  @staticmethod
  def get(k):
    value = redis.get(k)
    expired_at = redis.pttl(k) / 1000 + time.time() if value is not None else None
    return value, expired_at

  @staticmethod
  def set(k, v, ttl):
    return redis.set(k, v, "EX", ttl)

  @staticmethod
  def try_lock(k, lock_period):
    r = redis.set(k, 1, "NX", "EX", lock_period)
    return r == "OK"

  @staticmethod
  def unlock(k):
    redis.del(k)

class BaseCacheHandler:
  def __init__(self, key, ttl, lock_period, race_period):
    self.key = key
    self.ttl = ttl
    self.lock_period = lock_period
    self.race_period = race_period

  def set_meta(self, value, expired_at):
    self.orig_val = value
    self.expired_at = expired_at
Enter fullscreen mode Exit fullscreen mode

The BaseCacheHandler defines the constructors and a helper function.

Store is the core of the whole implementation, and I use Redis as a demonstration.

  • get(): In addition to getting the cache value, we also need to get the expiry time of the cache.
  • set(): Write the value and also set the expiry time.
  • try_lock(): Use Redis' atomic update to lock with NX.
  • unlock(): simply removes the key.

By assembling all these pieces, the cache decorator is complete, not only with the ability to extend the cache time but also with exclusive lock support.

Conclusion

This is a workable example, and we have arranged it in a more intuitive way to make it easier to understand. However, there are some things that could be refined.

For example, many places currently use a single command to operate Redis directly, and it would be better to write it in Redis pipeline. Moreover, it would be a good idea to write some simple logic as a script in Lua.

I have to say such an implementation is actually very complex, but does a read-aside cache really need to do that?

It depends on the load of the application and what we expect from the application.

If the backend system is strong and can handle a sudden spike, then a regular extended cache time can work. But if the backend is weak, it is necessary to consider a more solid approach.

Enhancing the caching mechanism is one option, but enhancing the backend system is also an option. There are several common ways to enhance the availability of the backend system.

  1. circuit breaker pattern
  2. service degradation
  3. multi-layer caching

This article provides an option to enhance the cache, without the need to deploy new components and just modify the logic, in my opinion, it is still worth.

Top comments (0)

Become a Moderator Do you want us to help make DEV a better place?

Fill out this survey and help us by becoming a tag moderator here at DEV.