Before the introduction, I will describe what is the resilience.
Distributed systems will fail, a resilient software system will not try to avoid failure but expect it and respond gracefully.
We all know data in cache will sooner or later be lost. The reason comes from many situations even the writing operation is successful. Let's take Redis as an example.
- Redis server crashed: of course, data was lost, because all data in Redis was stored in the memory.
- Redis replica went through the partition failure and recovered again.
- Even without any hardware/software failure, Redis spent all memory spaces and proceeded to evict the existing cache.
- If using
AWS, the upgrade process will also drop all entries.
- And so on.
As far as we know, the cache is not reliable and hard to ensure the data consistency. However, there are indeed some tasks which leveraged cache to improve the performance requiring the reliability.
We have talked about the cardinality counting in my previous article, and that is a great example to show the importance of data persistence. If the collections in Redis is no longer trusted, how can we design a large system? The solution is making your cache more resilient.
Following my example of health status, we want to make sure our sensor networks work properly; therefore, we record the activity of API calls in Redis. In order to simplify the explanations, I will choose bitmap with recording in days in this section.
The data in Redis bitmap must answer a question, how many days does the sensor work in this month?
We can leverage my previous statements as follows.
SETBIT sensorA_jan 1 1 SETBIT sensorA_jan 2 1 SETBIT sensorA_jan 3 1
The statements show that
A worked at 1/1, 1/2 and 1/3. And we can get the answer from:
BITCOUNT sensorA_jan. What if the Redis crashed at 1/2? The
bitocunt would become 2 without the record of 1/1. In such case, the
bitcount is distorted.
To tackle this issue, we can add an integrity into the collection, let's say
SETBIT sensorA_jan 0 1. When we want to retrieve
bitcount, we check the integrity first,
GETBIT sensorA_jan 0. If the result is
1, which means the
bitcount is under controlled, otherwise, we have to rebuild the data from a durable storage like a database.
Since we have to rebuilt the result from a database, why not do it at the first? The reason is quite simple, we want to keep the system high performance. We should ensure the reporting throughput by using cache and expect there should be failed someday. Nevertheless, we still can get the correct result by some heavy aggregating operations on the database.
This is how resilience does, and this article provides a straightforward example to demonstrate how to accomplish a cardinality counting resiliently.