At Appaloosa, we use Sidekiq a lot. In fact, we use it for all the background jobs we need. We also use a database (yes, this sounds crazy, I know.) with Active Record as ORM. Sometimes, Sidekiq jobs need to do stuff in our database, this looks like a normal situation, until…
The recurring error…
Until, this error pops many times:
could not obtain a connection from the pool within 5.000 seconds (waited 5.002 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)
And it pops many times, every day, at the same hour. Something like late at night, so yes, let’s try to understand and fix it.
But how can this happen?
Ok so, after checking all our monitoring tools we figured out that the error pops inside a Sidekiq job. And then, with more than 20 tabs opened on the browser (obviously, many were the same, opened from different sources, at different time), we found the theory and theoretically, the solution.
Theory
First of all, we have Active Record on one side with its proper configuration file. In this config/database.yml
file, you can find something like:
Remember the error? could not obtain a connection from the pool
. It must have something to do with that pool
configuration!
Ok let’s check it. The official documentation tells us that:
Active Record database connections are managed by ActiveRecord::ConnectionAdapters::ConnectionPool which ensures that a connection pool synchronizes the amount of thread access to a limited number of database connections.
and
If you try to use more connections than are available, Active Record will block you and wait for a connection from the pool. If it cannot get a connection, a timeout error similar to that given below will be thrown.
Here it is! This is what is happening:
Ok, then, what about the other side? How is Sidekiq managing its threads to prevent this error? The official wiki is pretty straightforward:
You can tune the amount of concurrency in your sidekiq process. By default, one sidekiq process creates 10 threads.
So, it creates as many threads as we mentioned on our config/sidekiq.yml
There is also a little note on the wiki:
Note that ActiveRecord has a connection pool which needs to be properly configured in config/database.yml to work well with heavy concurrency. Set the pool setting to something close or equal to the number of threads
THAT’S IT! If we have Sidekiq concurrency value equal to Active Record pool value we can’t have more threads than the accepted by Active Record, and no more errors. Voilà! Let’s check those values on our project, something should not be right.
and
??
So… is everything ok?
Ok, now things are really strange. On our configurations files we already have 3 for concurrency and 3 for pool fields, then, according to documentations, everything should work and we should not have any issues… But we have one! Then, I don’t know how but somewhere, at a time, 3 looks not to be equal to 3.
Let’s double check everything, somewhere one of those value has to be overridden.
Of course, the answer was NOPE.
Let’s investigate 🕵🏻♀️
As the error occurred each day at the same hour, we looked at the piece of code running at that moment and checked the logs.
We analyzed 40 lines that occured in less than 10 seconds. The result is a beautiful file named sidekiq logs sherlock
with many background colors and others for the text. And we found something interesting…
Can we reproduce it?
Of course, we can try. Let’s create a tiny Rails app with a database and Sidekiq. Let’s have two configuration files with the same configuration, as we have seen before:
Now we can create a worker:
This is a simple worker, doing nothing but calling a service which is going to do some stuff. For this service, we tried to reproduce the code we have, but this code is calling some other gem. To avoid complexity, we try to simplify it but keeping the main behavior.
Try to run your worker and look at the result…
ruby-2.5.3/gems/activerecord-5.2.1.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:199:in `block in wait_poll’: could not obtain a connection from the pool within 5.000 seconds (waited 5.010 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)
We are not sure that is exactly mimicking our behavior as the code uses transitive dependencies that are creating threads. But it looks like this and we ran into the same issue.
And it makes sense. Remember, we said that for each process Active Record assigns x
connections depending on the pool
value. Each connection is made by one thread. If our Sidekiq concurrency
value is equal to the pool
value, we should not have any problem, unless, something is creating a new thread and calling the database. That’s the case on our example above with the MyService
class!
The why!
In fact, it’s possible to do async job here but we need to be sure that we are not dealing with database, or at least, it’s not creating too much connections. In our case, we are inside a deep loop, which leads to too many database calls. 💥
If you don’t want to have some headaches, be careful on what you put inside your workers 🙂
Photo by Ayo Ogunseinde on Unsplash
Thank you for reading! 🙏
You should subscribe to our blog to be notified whenever a new article is available! 👋
This article was written by Mélanie Araujo Martins of Appaloosa’s dev team.
Top comments (3)
Notice that the real problem is that ActiveRecord doesn't automatically release checked out connections back into the connection pool. You can see that by running the following script:
Contrast that with Sequel which checks out connections just fine:
Nice One
Thanks ❤