DEV Community

Discussion on: Switching From Resque to Sidekiq

Collapse
 
rhymes profile image
rhymes

Hi Molly, great article! Once I migrated an app from DelayedJob to Resque and then a year after that from Resque to Sidekiq (the migration should have been DJ to Sidekiq but some gems weren't thread safe initially and I didn't want to risk it).

Sidekiq, even the free version, is much better than Resque. Rails is too memory heavy to use OS processes for concurrency in a cost effective manner anyway :D

The main reason though is that the app at some point was creating hundreds of thousands (with peaks past a million) of short lived jobs per day and Resque just couldn't keep up and the company was spending too much anyway :D

In addition, jobs were created in batches of tens of thousands or hundreds of thousands, in a multi tenant app. Basically: "Hi I'm client, and if I press this button I'm telling you app that you to queue 25k thousand jobs and process them now"

We were using Heroku so, in addition of being limited by Resque speed we had a cap of how many workers we could add, there's a finite amount of PostgreSQL connections you can open and even adding resque-pool wasn't enough. Redis's memory occupation was also skyrocketing. It was all too much.

Long story short, we had to find a better solution. Sidekiq was that.

The app went from 100k jobs in Resque with 8 servers in pooling taking roughly 1 hour 40 minutes to 100k jobs in Sidekiq with pooling with 5 servers (with half the memory each), all in roughly 10 minutes.

I was annoyed that we didn't switch sooner but extremely happy :D We saved so much money and had much more visibility on what was going on through the better dashboard.

Sometimes tools are not equivalent, at all ;)

Collapse
 
rudolfolah profile image
Rudolf Olah

That's a big difference. Did you experience any failed/dropped jobs with Resque? I've read a few articles that suggest that if a Resque worker dies randomly, it will drop whatever jobs were on that worker without a trace.

Collapse
 
molly profile image
Molly Struve (she/her)

We had issues with dirty exits and had a worker called the DirtyExitCleanupWorker which actually cleaned up those lost jobs. It was miserable.

Thread Thread
 
rudolfolah profile image
Rudolf Olah

Going to investigate further but that sounds about right. I had to switch something from an async Resque job to a synchronous long-running process running while SSH'd into an instance.

And this article from 2015 seems to also confirm the dirty exits issue: alfredo.motta.name/understanding-t...

TL;DR: Reduce the number of workers. Make smaller jobs.
...
I went from a 50% failure rate, to zero

It seems like the common failure rate for Resque is between 3% and 10% with some outliers of 0% or 1% or in this crazy case, 50%. I guess how acceptable this is depends on how much engineering talent you can spend on patching these issues.