Last month at DEV we made the switch from Memcache to Redis. This post talks about the why, the how, and also has a few gotchas to watch out for so you can get the most out of whatever cache solution you use.
Why We Switched
Memcache is a bit of a black box when it comes to caching. It is very hard to tell what is in it and how it is being used. When I first joined DEV I was greeted with a 75GB Memcache instance which we were paying a lot of money for. A 75GB cache seemed a bit big given our usage so I immediately started trying to figure out how I could view its contents. After a couple of hours of Googling, I figured out that unless you know the cached key you can't really see what is in Memcache. Not having visibility into your datastore is an SRE nightmare and it makes you feel pretty helpless and incapable of doing your job.
In addition to poor visibility, Memcache doesn't really play nice with Rails. In order to use it, you have to use the Dalli gem which I recently learned overrides a bunch of core cache methods in Rails to get Memcache to work. It's never great when you need to include an additional gem to get your cache solution to work, it just adds more complexity to the mix.
Finally, Memcache supports a very limited number of data structures. This is not a huge deal at first when all you want are simple key-value pairs like strings and numbers. However, as you grow, you may want more options for storing data and that's where Redis shines. It offers many different data structures that you can use to store data more efficiently.
For all the above reasons we chose to replace Memcache with Redis for our application cache. In addition to everything I mentioned above, Redis also has nice features such as asynchronous deletion and Lua scripts to handle complicated logic. Plus, it allows us to use a faster background worker service such as Sidekiq which we are in the process of migrating to from DelayedJob.
Once we decided that we wanted to make the switch the next question was how do we go about doing it.
Strategy For Moving Keys
In order to move to Redis, we decided to do it slowly by rolling keys over one or a couple at a time. The point of this was to ensure that a large cold cache didn't slow things down. It also allowed us to look at everything we were caching and evaluate if we really needed to cache it.
The way we did this was by creating a new Rails cache client using Redis and calling it RedisRailsCache. This new client behaved just like Rails.cache
but instead of pointing to Memcache, it pointed to Redis.
RedisRailsCache = ActiveSupport::Cache::RedisCacheStore.new(url: redis_url, expires_in: DEFAULT_EXPIRATION)
Rolling over individual keys went smoothly and we really didn't have any surprises or slow down along the way. Then came the dilemma of rolling over the Rails fragment caches. Initially, the plan was to do it the same way we did the Rails.cache
keys, by replacing the fragment calls one by one with a client of some sort that pointed to Redis.
One way we attempted to do this was by setting up a dual Rails cache as our Rails cache_store. I was pretty darn proud of the DualRailsStore
that I had created. However, when we pushed it out to production we learned that the Memcache dalli gem was NOT storing things the way the Rails cache store expected which caused a lot of things to break. We quickly reverted the PR and went back to the drawing board.
All or Nothing
After looking at all of our options and evaluating their risk I decided to throw out a crazy idea in Slack:
Molly: "Because of the dalli_store gem, the way we cache things in Memcache is completely different from how Rails cache's things which is why trying to do them in parallel is such a PITA. SO I have another crazy idea: We rely a lot on Fastly for caching, so I'm wondering how bad do we think it would be to just flip to Redis? Could we schedule an hour or 2 of "maintenance"(basically expected degraded performance time) and just do it?"
This was my first time really throwing out a risky idea like this at DEV and I thought for sure people were going to slam it down and respond with "No way, you are crazy!" But instead, this is the response I got:
Ben: "That’s a possibility. I definitely have ideas about how to make it smoother."
My reaction IRL 👇 "Oh thank the lord they don't think I'm crazy!"
After that, I figured we could schedule it for a weekend or night when things slowed down, but to my surprise, Ben asked if I wanted to do it that afternoon. Of course, I replied "YEAH!" I would much rather rip the bandaid off then wait in anticipation.
Flipping the Switch
In order to flip to Redis, we made a few changes to the app to help lighten the heavy additional load we were expecting
- Added additional dynos(servers) to help with the load.
- Commented out the cache-busting for the home page and tag pages which can be served more stale
- Cut off traffic to “additional content boxes” and “social previews” controllers via Fastly since those were heavily cached and we could live without them for a little bit
- Removed “Additional articles” below posts and on the sidebar temporarily
Once all of that was done we added an ENV variable that we could use to flip over to Redis. Once things were in place we set the ENV variable and waited and watched. Ben and I both were expecting to see Redis traffic spike and load to slam our servers. We braced for the worst. But the worst never came.
At one point we thought possibly the ENV variable wasn't working so I jumped into Redis to see if fragment keys were coming in. Sure enough, they were! Redis traffic jumped a little but then the number of keys slowly and steadily climbed as people moved around the site. There were no giant spikes or dramatic slowdowns. Honestly, none of our alarms even went off. Ben and I watched and waited for almost 2 hours before we could convince ourselves that things were probably going to be fine and sure enough they were.
I had our Redis storage cranked to 25 GB because after having a 75GB Memcache I had no idea what to expect. Turns out there was a lot of cruft in Memcache because in Redis right now the size of our cached data sits around 3 GB.
Cacheing Gotcha's
During this whole process, I did A LOT of looking at and evaluating cache's across DEV's platform and made a lot of changes even before making the switch to Redis. Here are a couple of common problems I saw that you should watch out for when using any sort of cache store like Redis or Memcache.
Creating an Unnecessary External Request
Redis and Memcache are both super fast which is why they are great to work with. However, you are still making an external request from your server when you are talking to them and that external request takes time.
A few times I saw chunks of HTML that were being cached, but those chunks weren't making any external requests when they rendered. In other words, rather than serving the HTML from the server we were instead making an external request to get it from Redis. Redis may be fast, but it can't beat the speed of internal memory. If a chunk of code is not making any external requests, don't add one by unnecessarily caching it. You will only slow your app down in the process. Here is one example of a place we removed caching completely.
Not Expiring Keys
Especially when you are starting out with an application it can be really easy to shove stuff into an external cache without an expiration. This easily can get out of hand and lead to things like a 75GB cache full of junk you don't actually need.
When you set up your cache store client in your code make sure you set a default expiration. Every time a key goes in, it needs to have an expiration set. You can try to rely on devs to ensure they are set, but in my experience, we are all pretty unreliable. The easiest thing to do is have the code set it by default whenever an expiration is not explicitly set. That way nothing slips through the cracks.
Low Cache ROI
Adding a cache adds complexity and overhead. You now have to make sure that the key is expired when necessary otherwise you end up with outdated information being shown to users. You also now have another variable to take into consideration for testing and development. Do you cache things in those environments? Do you stub the cache? etc. Because of that added complexity, you want to make sure you are actually getting benefits from your cache when you use it.
The purpose of a cache is to make things fast by taking data that normally takes a long time to retrieve, remembering it, and serving it without all the original overhead of retrieving it from its original source. However, sometimes the cache isn't worth it.
One scenario is if you are caching an already very fast request. For example, if you are caching a very fast User.find()
command you need to consider, is the small increase in speed from a cache worth having to ensure anytime that user changes you have to bust that cache?
Another scenario where a cache might not be ideal is if you are not requesting the data a lot. Let's say you cache a page view for a user so if the user reloads the page everything is lightning fast. That's awesome if the user is going to sit there and reload the page over and over. But if they look at the page and move on, then the cache is not doing you any good.
Finding these scenarios can be tricky, but one way to do it is by looking at your Hit rate. Most cache databases will tell you your hit rate as a percentage like 75%. That means 75% of the time a key is requested, the cache has it. The higher the hit rate, the more use your cache is getting and the more it is speeding things up.
Too Many Unique Keys
This kind of relates to the ROI scenario I talked about above. One thing to watch out for is too many unique keys. Let me explain what I mean. Say you have this chunk of code for caching:
Rails.cache.fetch("user-follow-count-#{id}-#{updated_at.rfc3339}", expires_in: 1.hour) do
followers.count
end
That code caches the number of followers a user has and we have added the updated_at
timestamp to help ensure that it doesn't get stale if a user is updated. However, there are a lot of ways a user can be updated and its follower count won't change. For example, we could update the name or email or many other things and each time we would be creating a new cache key here while the follower count remains unchanged.
A better way to do this would be to name our cache key with something more specific like the last_followed_at
timestamp:
Rails.cache.fetch("user-follow-count-#{id}-#{last_followed_at.rfc3339}", expires_in: 1.hour) do
followers.count
end
This way the cache key will only be reset if the user is followed which is appropriate since that means the follower count will have changed. Be on the lookout for situations like these where you might be using an expiration method like updated_at
that is too aggressive and lowers the effectiveness of your cache. Ideally, the cache key should only change when the value of the cache changes.
Happy Caching!
Hopefully, you found this post educational and our experience at DEV switching our cache store can help you decide if a switch is right for you. If you want more details or would like to see all the PRs that it took to make this transition checkout this Github issue. Let me know if you have any questions or any part of the post is not clear!
Top comments (18)
Thanks for the detailed migration story and also the don't list, very useful!
Few comments about something which triggered my curiosity:
as a developer, it is long time that I have not used any caching directly, back in 2013 we were using Oracle Coherence (closed proprietary software, but I did not have any choice) with Java (although I do not think it is relevant for this discussion) and the key of the cache entries would not change when the value changed.
I.e., reusing your example, the cache key would be just:
"user-follow-count-#{id}"
And then every time one more follower had to be added for a user, the application code would increment the value of the corresponding cache entry (and also in the database in a transactional fashion), but the key of the cache entry would remain the same.
In your example, I see that whenever the value of the cache entry changes, the key also changes since the timestamp is part of the key.
You said that the
updated_at
timestamp is present in the key "to help ensure that it doesn't get stale if a user is updated" but at first look it would seem neater if the timestamp was part of the value (object) instead of being part of the key and it could serve the same purpose.Also, having the timestamp included in the key means that just the user id is not enough to construct the desired key, which I think adds complexity to the client code.
How does the application code retrieves a specific cache entry value?
Does it use a wildcard for the timestamp part of the key?
To sum up, what are the advantages of having mutable keys when compared with the immutable cache keys which I have described above?
Thanks in advance!
The advantage to having the keys change is that you never need to worry about updating or deleting them which can add a lot of code complexity. Instead, if I have a user with an
id
andfollowed_at
timestamp that I use to storefollower_count
then any time thatlast_followed_at
timestamp changes(ie a follower is added or removed) my cache request:will create a new key to store the new count. The old key will simply expire. Now every time I request that key until it changes again I use the
id
andlast_followed_at
timestamp and the cache will return the correct key.If I do not use the
last_followed_at
timestamp then every time the follower count changes I have to add additional code to delete the old cache key. By using the timestamp this code is not needed.Ok, now I fully understand it.
So the entries in your cache are actually immutable (both the key and the value never change) and whenever you need to store a new value (i.e., increase the number of followers for a specific user), you just create a new cache entry with the new key (the timestamp being the part which is different from the previous entry key) and also with the new value.
BTW, I previously missed the fact that the
last_followed_at
is an instance field of theUser
object.And as you explained, this way the application code does not bother to delete the outdated cache entries because they will be purged by Redis automatically at the expiration time.
The only drawback I see with this approach is that you are keeping outdated entries in the cache for longer than strictly needed (until expiration time) but you also explained that data space is not a constraint in your case so far, hence, it is a fair compromise in order to remove complexity from the application code.
Everything makes sense now, thank you!
WOOT! Glad I was able to explain it better!
To be clear, we do this for a lot of really simple keys, but in the future, any keys that are very large we would likely plan to remove them as soon as they become invalid rather than letting them hang around. Or, as you said, start removing them more aggressively if cache size becomes a problem.
I am still confused by the reasoning you have provided, Molly. I may be missing something?
Redis inserts/updates a key using the SET command. It automatically overwrites the value of a key if the same key is provided again. So you do not have to worry about writing new code to update, where ever you are currently saving to Redis when the follower count changes with a new key, you can just save the follower count with the same old key?
With your current approach you first need to query for the "last_followed_at" value, then only can you query the user-follow-count? (The "last_followed_at" value may be part of the User object which has already been retrieved but you are still looking it up, yes?) But if you have the same key which is always updated, it is never stale.
And as you mention in your follow up comment, that you'll implement deletion for larger keys as they become invalid, but then you are introducing that same code complexity you were trying to avoid? (Though as per my understanding stated above, I don't believe there is code complexity to be added.)
I guess Victor's initial question remains unclear to me, "what are the advantages of having mutable keys when compared with the immutable cache keys"?
The real advantage here is that Rails has this handy
fetch
method which will default look for a key, if it is there return it, if it is not it will set it. This means we can do ALL the work we need to with this key in this single fetch block rather than having to set up aset
ANDdel
command.Ever written anything on "caching for beginners"?
I have written multiple posts on caching but none that are tailored specifically for beginners. What kinds of things do you think would be valuable to cover? I might add it to my list of things to write about 😊
How did you learn about caching?
I learned about caching by reading about it then using it on the job
Makes sense. It's one of those things that's tough for beginners I think because it doesn't make a ton of sense on small personal projects. Sort of like testing.
You seem to know a lot about it, so if you're able to channel your inner-beginner and reflect on things you wish you knew or were explained better way back when you first learned it and turn that into an article... I think that would make an excellent post!
A little while ago I started a Level Up Your Ruby Skillz series and I just started a post that will make caching the next topic covered 😎
Nice guide, really happy to see how it worked! Also, a nice tip - if you use Redis for ActiveJob as well, it's good to keep namespaces separate for cache and jobs. Otherwise, while cleaning cache, you can drop enqueued jobs - oops! 🤷♂️
Nice, I really love the war story on it and what to look out for when migrating.
Oh well, thank you! We are currently looking into caching db requests and testing redis atm. Have you tested Gunjs yet? Wondering what the differences, use cases and pro/cons are between this and redis. It seems faster and way better developed as I can see from the GitHub repo. Maybe you gonna write a next post about that? :) would love to read that as well!
We did not test that and will be sticking with Redis for the foreseeable future. But if you end up testing Gunjs I would love to know how it compares!
Very nice!! I am just curious Molly, did you have a rough idea how much the flip could save before you made the suggestion?
Based on what I saw being cached in the app 75 GB seemed absurd. My guess was that we would get down to between 5-10GB but we ended up even lower than that which made me pretty happy. That guess was based on what I was seeing being cached and the fact that at my previous company we were caching way more data and our cache size was between 10-15GB.