loading...
Cover image for Switching From Resque to Sidekiq

Switching From Resque to Sidekiq

molly_struve profile image Molly Struve (she/her) ・7 min read

If you follow me on Twitter then you might have caught on that one of my recent projects has been migrating all of our background processing jobs at Kenna from Resque to Sidekiq. I have been asked more than a few times why we chose to do that, so I decided to write a post about it!

1 Year Ago

A year ago at Kenna, we had our background processing jobs split between Sidekiq and Resque. We had our long-running jobs in Resque which could handle rolling restarts every time we deployed so we didn't have to worry about long running jobs getting killed. We had our fast, high quantity jobs in Sidekiq since it was better and faster at processing small, quick running jobs.

Then Sidekiq Enterprise 1.7.0 happened. Sidekiq Enterprise 1.7 was released in January 2018 and came with support for Rolling Restarts which caught our attention. That's when we started wondering if we could use Sidekiq to process all of our jobs instead of just some of them.

Why We Switched

Improved Processing Speed

Sidekiq in terms of queueing and running jobs is faster than Resque because of its multi-threaded nature.

Resque is a process-based background job framework, which means it boots up a copy of your application code for every one of its worker processes. This uses up a lot of memory resources and can be very slow every time you have to boot up a new process. Sidekiq, however, is thread-based. This means Sidekiq will boot up your application code once and then use multiple threads to process multiple jobs.

I really wish I could share with you exactly how much faster our jobs run now, but I have yet to really dig into those stats because in general everything "feels" faster(I know that's horrible to say as an SRE, but sometimes knowing the numbers isn't worth the extra time when you just know in your gut it's better). What I can tell you is that our fleet of 75 Sidekiq worker processes can handle all of their original jobs PLUS all of the jobs that were being handled by 100 Resque worker processes. Needless to say, moving everything to Sidekiq has saved us a bit of money.

Maintainer and Plugin Support

Sidekiq and its plugins are much newer and better maintained than Resque. They are also tuned for larger job volume and perform better for our use case. We have had to patch Resque plugins multiple times because the Redis commands they were using were grossly inefficient at scale.

To put it into perspective for you, when I looked into upgrading to Resque 2.0 I found that 4 of the plugins we were using were currently unmaintained and only supported Resque 1.2. If we had stuck with Resque, we would have had to fork and update all of those plugins ourselves.

Better Features

Sidekiq Enterprise alone has many features that we have had to include multiple Resque plugins to get. Simple things like configurable retry logic are huge for us and Sidekiq supports that right out of the box. Another Sidekiq Enterprise feature is unique jobs. In order to get that in Resque, we had to install a pretty beefy plugin.

Having the ability to give queues preference/importance over one and other without having to segment workers for specific queues is also big for us. We often had many Resque workers doing nothing all day because they could only pull from a single queue that was heavily used overnight.

Another big win that Sidekiq Enterprise got us was periodic jobs. To schedule jobs in Resque, you had to install another gem AND you had to start up a separate process to run the scheduler. Sidekiq's scheduler runs on a worker so your application does not need to keep track of another process.

Batches are another feature we got from Sidekiq that allow us to handle some of our more complicated job workflows. To handle those before in Resque, I kid you not, we had loops running that would inefficiently check if a job was still running and wait until it finished. We also had jobs that would re-enqueue themselves over and over again waiting for another job to complete. Removing all of those sleep statements and loops was really exciting when we were making the switch.

Sidekiq plugins also have features that we couldn't find in Resque or in any of its plugins. Queue balancing was one of those features we were not able to find in any Resque plugins. For example, the ability to throttle across all jobs based on arguments like client ID is not something we found a solution for in Resque because of the way Resque handles its queues.

Dashboard and Command Line Interface

To get a good dashboard for Resque we had to include plugins like the resque-cleaner gem(yet another plugin not maintained) if we wanted any visibility into why jobs were failing. Right out of the box Sidekiq has a great UI that makes it easy to navigate through jobs and address issues when they arise right from the browser. Here is a peek at what the UI looks like:

Because the Sidekiq API is so built out to handle the UI, talking to Sidekiq from a Rails console is incredibly easy and way more robust and performant than Resque. Things like deleting a specific class from a queue are easy. With Sidekiq Pro you also have the ability to pause queues which can come in handy if something gets out of hand.

Operations

Our operations team was also fully bought into making the switch from Resque to Sidekiq. Resque has no built-in memory limits which meant that when jobs got out of hand they would take down whole servers with them. Eventually, our operations team implemented resource limits using systemd to put memory limits on processes to handle this issue. Sidekiq Enterprise, on the other hand, gives us the ability to limit our memory usage for each process to ensure that none of them take down our servers.

Customer Support

Because Sidekiq is currently being worked on and maintained the support and response you get when you have an issue or a question is incredibly fast. Due Resque plugins being poorly maintained or altogether inactive, your chance of finding help with any of them when you have a problem is not great.

Simplicity

Why have two tools when you can accomplish everything you want with one? Given Sidekiq can now meet all of our demands and it does the job better for all of the above reasons it seems like a no brainer to consolidate. It also means now that we only have one background processing tool that devs need to be familiar with rather than two. This might seem small but I can't tell you the number of times I have had to explain the difference between Sidekiq and Resque and help people figure out where to put a job. Now it is simple, we have one framework and if you want to process something in the background you make a job in that framework.

This has also vastly improved the lives of our operations team because they no longer need to keep track of multiple different types of workers. We have one less thing we have to deploy and keep track of in all our different environments.

Making the Switch

To actually make the switch took a few months because we did it slowly and methodically. We moved about 50 jobs from Resque to Sidekiq in small groups or one by one. Heavily used jobs were given some time to "bake" to ensure we had our queues balanced properly between all of our jobs. We had a few jobs that immediately OOM'ed when we first ran them on Sidekiq. This forced us to overhaul how those jobs used memory and, in the end, we were able to improve their performance dramatically.

Making the decision to switch from Resque to Sidekiq was one of the first big projects I spearheaded as an SRE. At the start of the project, I honestly was not 100% sure if this was going to be the right decision for us. We had gotten so used to Resque that to change it was a big undertaking. However, as I issued PR after PR to move jobs to Sidekiq, I started to feel really good about the decision. 90% of the PRs I issued were net code deletions. The code complexity was decreasing right before my eyes.

The switch, of course, did not come without its challenges. There were times when queues were balanced incorrectly and we had jobs backup. But even when those incidents occurred, the on-call devs who helped me handle and fix them were constantly complimenting how simple the Sidekiq API was to work with from a Rails console.

Looking back on the decision and the whole process, I am VERY happy to say we have no regrets. Just the other day we had to reboot some of our private cloud environments and operations was raving about how nice it was to only have to worry about one type of worker. For anyone running background jobs on Resque, hopefully, this post has given you something to chew on it you are thinking about changing up your framework.

As always, let me know if anyone has any questions!

Posted on by:

molly_struve profile

Molly Struve (she/her)

@molly_struve

International Speaker 🗣 Runner 🏃‍♀️ Always Ambitious. Never Satisfied. I ride 🦄's IRL

Discussion

markdown guide
 

Hi Molly, great article! Once I migrated an app from DelayedJob to Resque and then a year after that from Resque to Sidekiq (the migration should have been DJ to Sidekiq but some gems weren't thread safe initially and I didn't want to risk it).

Sidekiq, even the free version, is much better than Resque. Rails is too memory heavy to use OS processes for concurrency in a cost effective manner anyway :D

The main reason though is that the app at some point was creating hundreds of thousands (with peaks past a million) of short lived jobs per day and Resque just couldn't keep up and the company was spending too much anyway :D

In addition, jobs were created in batches of tens of thousands or hundreds of thousands, in a multi tenant app. Basically: "Hi I'm client, and if I press this button I'm telling you app that you to queue 25k thousand jobs and process them now"

We were using Heroku so, in addition of being limited by Resque speed we had a cap of how many workers we could add, there's a finite amount of PostgreSQL connections you can open and even adding resque-pool wasn't enough. Redis's memory occupation was also skyrocketing. It was all too much.

Long story short, we had to find a better solution. Sidekiq was that.

The app went from 100k jobs in Resque with 8 servers in pooling taking roughly 1 hour 40 minutes to 100k jobs in Sidekiq with pooling with 5 servers (with half the memory each), all in roughly 10 minutes.

I was annoyed that we didn't switch sooner but extremely happy :D We saved so much money and had much more visibility on what was going on through the better dashboard.

Sometimes tools are not equivalent, at all ;)

 

That's a big difference. Did you experience any failed/dropped jobs with Resque? I've read a few articles that suggest that if a Resque worker dies randomly, it will drop whatever jobs were on that worker without a trace.

 

We had issues with dirty exits and had a worker called the DirtyExitCleanupWorker which actually cleaned up those lost jobs. It was miserable.

Going to investigate further but that sounds about right. I had to switch something from an async Resque job to a synchronous long-running process running while SSH'd into an instance.

And this article from 2015 seems to also confirm the dirty exits issue: alfredo.motta.name/understanding-t...

TL;DR: Reduce the number of workers. Make smaller jobs.
...
I went from a 50% failure rate, to zero

It seems like the common failure rate for Resque is between 3% and 10% with some outliers of 0% or 1% or in this crazy case, 50%. I guess how acceptable this is depends on how much engineering talent you can spend on patching these issues.

 

Sidekiq is a really great piece of software with lots of features. I was using it in one of my projects, too. But since I only used it to process few I/O tasks asynchronously (< 100 per day) I moved on to Brandon Hilkert's sucker punch. He also contributed to sidekiq previously.

 

Sidekiq all the way. Main reason for us to move to sidekiq were threading.