Discussion on: Switching From Resque to Sidekiq

View post

Hi Molly, great article! Once I migrated an app from DelayedJob to Resque and then a year after that from Resque to Sidekiq (the migration should have been DJ to Sidekiq but some gems weren't thread safe initially and I didn't want to risk it).

Sidekiq, even the free version, is much better than Resque. Rails is too memory heavy to use OS processes for concurrency in a cost effective manner anyway :D

The main reason though is that the app at some point was creating hundreds of thousands (with peaks past a million) of short lived jobs per day and Resque just couldn't keep up and the company was spending too much anyway :D

In addition, jobs were created in batches of tens of thousands or hundreds of thousands, in a multi tenant app. Basically: "Hi I'm client, and if I press this button I'm telling you app that you to queue 25k thousand jobs and process them now"

We were using Heroku so, in addition of being limited by Resque speed we had a cap of how many workers we could add, there's a finite amount of PostgreSQL connections you can open and even adding resque-pool wasn't enough. Redis's memory occupation was also skyrocketing. It was all too much.

Long story short, we had to find a better solution. Sidekiq was that.

The app went from 100k jobs in Resque with 8 servers in pooling taking roughly 1 hour 40 minutes to 100k jobs in Sidekiq with pooling with 5 servers (with half the memory each), all in roughly 10 minutes.

I was annoyed that we didn't switch sooner but extremely happy :D We saved so much money and had much more visibility on what was going on through the better dashboard.

Sometimes tools are not equivalent, at all ;)

Rudolf Olah • Sep 30 '19

That's a big difference. Did you experience any failed/dropped jobs with Resque? I've read a few articles that suggest that if a Resque worker dies randomly, it will drop whatever jobs were on that worker without a trace.

Molly Struve (she/her) • Oct 1 '19

We had issues with dirty exits and had a worker called the DirtyExitCleanupWorker which actually cleaned up those lost jobs. It was miserable.

Rudolf Olah • Oct 1 '19

Going to investigate further but that sounds about right. I had to switch something from an async Resque job to a synchronous long-running process running while SSH'd into an instance.

And this article from 2015 seems to also confirm the dirty exits issue: alfredo.motta.name/understanding-t...

TL;DR: Reduce the number of workers. Make smaller jobs.
...
I went from a 50% failure rate, to zero

It seems like the common failure rate for Resque is between 3% and 10% with some outliers of 0% or 1% or in this crazy case, 50%. I guess how acceptable this is depends on how much engineering talent you can spend on patching these issues.