Hi Molly, great article! Once I migrated an app from DelayedJob to Resque and then a year after that from Resque to Sidekiq (the migration should have been DJ to Sidekiq but some gems weren't thread safe initially and I didn't want to risk it).
Sidekiq, even the free version, is much better than Resque. Rails is too memory heavy to use OS processes for concurrency in a cost effective manner anyway :D
The main reason though is that the app at some point was creating hundreds of thousands (with peaks past a million) of short lived jobs per day and Resque just couldn't keep up and the company was spending too much anyway :D
In addition, jobs were created in batches of tens of thousands or hundreds of thousands, in a multi tenant app. Basically: "Hi I'm client, and if I press this button I'm telling you app that you to queue 25k thousand jobs and process them now"
We were using Heroku so, in addition of being limited by Resque speed we had a cap of how many workers we could add, there's a finite amount of PostgreSQL connections you can open and even adding resque-pool wasn't enough. Redis's memory occupation was also skyrocketing. It was all too much.
Long story short, we had to find a better solution. Sidekiq was that.
The app went from 100k jobs in Resque with 8 servers in pooling taking roughly 1 hour 40 minutes to 100k jobs in Sidekiq with pooling with 5 servers (with half the memory each), all in roughly 10 minutes.
I was annoyed that we didn't switch sooner but extremely happy :D We saved so much money and had much more visibility on what was going on through the better dashboard.
That's a big difference. Did you experience any failed/dropped jobs with Resque? I've read a few articles that suggest that if a Resque worker dies randomly, it will drop whatever jobs were on that worker without a trace.
Going to investigate further but that sounds about right. I had to switch something from an async Resque job to a synchronous long-running process running while SSH'd into an instance.
TL;DR: Reduce the number of workers. Make smaller jobs.
...
I went from a 50% failure rate, to zero
It seems like the common failure rate for Resque is between 3% and 10% with some outliers of 0% or 1% or in this crazy case, 50%. I guess how acceptable this is depends on how much engineering talent you can spend on patching these issues.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Hi Molly, great article! Once I migrated an app from DelayedJob to Resque and then a year after that from Resque to Sidekiq (the migration should have been DJ to Sidekiq but some gems weren't thread safe initially and I didn't want to risk it).
Sidekiq, even the free version, is much better than Resque. Rails is too memory heavy to use OS processes for concurrency in a cost effective manner anyway :D
The main reason though is that the app at some point was creating hundreds of thousands (with peaks past a million) of short lived jobs per day and Resque just couldn't keep up and the company was spending too much anyway :D
In addition, jobs were created in batches of tens of thousands or hundreds of thousands, in a multi tenant app. Basically: "Hi I'm client, and if I press this button I'm telling you app that you to queue 25k thousand jobs and process them now"
We were using Heroku so, in addition of being limited by Resque speed we had a cap of how many workers we could add, there's a finite amount of PostgreSQL connections you can open and even adding
resque-pool
wasn't enough. Redis's memory occupation was also skyrocketing. It was all too much.Long story short, we had to find a better solution. Sidekiq was that.
The app went from 100k jobs in Resque with 8 servers in pooling taking roughly 1 hour 40 minutes to 100k jobs in Sidekiq with pooling with 5 servers (with half the memory each), all in roughly 10 minutes.
I was annoyed that we didn't switch sooner but extremely happy :D We saved so much money and had much more visibility on what was going on through the better dashboard.
Sometimes tools are not equivalent, at all ;)
That's a big difference. Did you experience any failed/dropped jobs with Resque? I've read a few articles that suggest that if a Resque worker dies randomly, it will drop whatever jobs were on that worker without a trace.
We had issues with dirty exits and had a worker called the DirtyExitCleanupWorker which actually cleaned up those lost jobs. It was miserable.
Going to investigate further but that sounds about right. I had to switch something from an async Resque job to a synchronous long-running process running while SSH'd into an instance.
And this article from 2015 seems to also confirm the dirty exits issue: alfredo.motta.name/understanding-t...
It seems like the common failure rate for Resque is between 3% and 10% with some outliers of 0% or 1% or in this crazy case, 50%. I guess how acceptable this is depends on how much engineering talent you can spend on patching these issues.