If you follow me on Twitter then you might have caught on that one of my recent projects has been migrating all of our background processing jobs at Kenna from Resque to Sidekiq. I have been asked more than a few times why we chose to do that, so I decided to write a post about it!
A year ago at Kenna, we had our background processing jobs split between Sidekiq and Resque. We had our long-running jobs in Resque which could handle rolling restarts every time we deployed so we didn't have to worry about long running jobs getting killed. We had our fast, high quantity jobs in Sidekiq since it was better and faster at processing small, quick running jobs.
Then Sidekiq Enterprise 1.7.0 happened. Sidekiq Enterprise 1.7 was released in January 2018 and came with support for Rolling Restarts which caught our attention. That's when we started wondering if we could use Sidekiq to process all of our jobs instead of just some of them.
Sidekiq in terms of queueing and running jobs is faster than Resque because of its multi-threaded nature.
Resque is a process-based background job framework, which means it boots up a copy of your application code for every one of its worker processes. This uses up a lot of memory resources and can be very slow every time you have to boot up a new process. Sidekiq, however, is thread-based. This means Sidekiq will boot up your application code once and then use multiple threads to process multiple jobs.
I really wish I could share with you exactly how much faster our jobs run now, but I have yet to really dig into those stats because in general everything "feels" faster(I know that's horrible to say as an SRE, but sometimes knowing the numbers isn't worth the extra time when you just know in your gut it's better). What I can tell you is that our fleet of 75 Sidekiq worker processes can handle all of their original jobs PLUS all of the jobs that were being handled by 100 Resque worker processes. Needless to say, moving everything to Sidekiq has saved us a bit of money.
Sidekiq and its plugins are much newer and better maintained than Resque. They are also tuned for larger job volume and perform better for our use case. We have had to patch Resque plugins multiple times because the Redis commands they were using were grossly inefficient at scale.
To put it into perspective for you, when I looked into upgrading to Resque 2.0 I found that 4 of the plugins we were using were currently unmaintained and only supported Resque 1.2. If we had stuck with Resque, we would have had to fork and update all of those plugins ourselves.
Sidekiq Enterprise alone has many features that we have had to include multiple Resque plugins to get. Simple things like configurable retry logic are huge for us and Sidekiq supports that right out of the box. Another Sidekiq Enterprise feature is unique jobs. In order to get that in Resque, we had to install a pretty beefy plugin.
Having the ability to give queues preference/importance over one and other without having to segment workers for specific queues is also big for us. We often had many Resque workers doing nothing all day because they could only pull from a single queue that was heavily used overnight.
Another big win that Sidekiq Enterprise got us was periodic jobs. To schedule jobs in Resque, you had to install another gem AND you had to start up a separate process to run the scheduler. Sidekiq's scheduler runs on a worker so your application does not need to keep track of another process.
Batches are another feature we got from Sidekiq that allow us to handle some of our more complicated job workflows. To handle those before in Resque, I kid you not, we had loops running that would inefficiently check if a job was still running and wait until it finished. We also had jobs that would re-enqueue themselves over and over again waiting for another job to complete. Removing all of those sleep statements and loops was really exciting when we were making the switch.
Sidekiq plugins also have features that we couldn't find in Resque or in any of its plugins. Queue balancing was one of those features we were not able to find in any Resque plugins. For example, the ability to throttle across all jobs based on arguments like client ID is not something we found a solution for in Resque because of the way Resque handles its queues.
To get a good dashboard for Resque we had to include plugins like the resque-cleaner gem(yet another plugin not maintained) if we wanted any visibility into why jobs were failing. Right out of the box Sidekiq has a great UI that makes it easy to navigate through jobs and address issues when they arise right from the browser. Here is a peek at what the UI looks like:
Because the Sidekiq API is so built out to handle the UI, talking to Sidekiq from a Rails console is incredibly easy and way more robust and performant than Resque. Things like deleting a specific class from a queue are easy. With Sidekiq Pro you also have the ability to pause queues which can come in handy if something gets out of hand.
Our operations team was also fully bought into making the switch from Resque to Sidekiq. Resque has no built-in memory limits which meant that when jobs got out of hand they would take down whole servers with them. Eventually, our operations team implemented resource limits using
systemd to put memory limits on processes to handle this issue. Sidekiq Enterprise, on the other hand, gives us the ability to limit our memory usage for each process to ensure that none of them take down our servers.
Because Sidekiq is currently being worked on and maintained the support and response you get when you have an issue or a question is incredibly fast. Due Resque plugins being poorly maintained or altogether inactive, your chance of finding help with any of them when you have a problem is not great.
Why have two tools when you can accomplish everything you want with one? Given Sidekiq can now meet all of our demands and it does the job better for all of the above reasons it seems like a no brainer to consolidate. It also means now that we only have one background processing tool that devs need to be familiar with rather than two. This might seem small but I can't tell you the number of times I have had to explain the difference between Sidekiq and Resque and help people figure out where to put a job. Now it is simple, we have one framework and if you want to process something in the background you make a job in that framework.
This has also vastly improved the lives of our operations team because they no longer need to keep track of multiple different types of workers. We have one less thing we have to deploy and keep track of in all our different environments.
To actually make the switch took a few months because we did it slowly and methodically. We moved about 50 jobs from Resque to Sidekiq in small groups or one by one. Heavily used jobs were given some time to "bake" to ensure we had our queues balanced properly between all of our jobs. We had a few jobs that immediately OOM'ed when we first ran them on Sidekiq. This forced us to overhaul how those jobs used memory and, in the end, we were able to improve their performance dramatically.
Making the decision to switch from Resque to Sidekiq was one of the first big projects I spearheaded as an SRE. At the start of the project, I honestly was not 100% sure if this was going to be the right decision for us. We had gotten so used to Resque that to change it was a big undertaking. However, as I issued PR after PR to move jobs to Sidekiq, I started to feel really good about the decision. 90% of the PRs I issued were net code deletions. The code complexity was decreasing right before my eyes.
The switch, of course, did not come without its challenges. There were times when queues were balanced incorrectly and we had jobs backup. But even when those incidents occurred, the on-call devs who helped me handle and fix them were constantly complimenting how simple the Sidekiq API was to work with from a Rails console.
Looking back on the decision and the whole process, I am VERY happy to say we have no regrets. Just the other day we had to reboot some of our private cloud environments and operations was raving about how nice it was to only have to worry about one type of worker. For anyone running background jobs on Resque, hopefully, this post has given you something to chew on it you are thinking about changing up your framework.
As always, let me know if anyone has any questions!