Recently, our management needed a way to export invoices in bulk. After the manager selects the first and last invoice for the batch in a web form, an asynchronous process should start that generates PDF files for the invoices, packs them into a zip file and sends the manager an email with a link to download the export. Now, generating the PDFs is slow, very slow. For larger batches involving hundreds or thousands of invoices, this process can easily take 10 or 15 minutes or even more.
So how do we trigger such a long-running process from a Rails request? The first option that comes to mind is a background job run by some of the queuing back-ends such as Sidekiq, Resque or DelayedJob, possibly governed by ActiveJob. While this would surely work, the problem with all these solutions is that they usually have a limited number of workers available on the server and we didn’t want to potentially block other important background tasks for so long.
What we wanted instead was to run a new, separate process from the Rails request. Something like running a Rake task but triggered by a web request. In fact, we even had the bulk export already implemented as a Rake task, so what we actually wanted was to make this task accessible from our admin web interface.
The standard way on Unix-like systems to spawn a new process is to
fork it. In a Rails controller,
forking a rake task could look like this:
class BulkInvoiceExportsController < ApplicationController def create child = fork do exec("bin/rails export_invoices FROM=20220001 TO=20220100 \\ >> /tmp/bulk_invoices_export.log 2>&1") end Process.detach(child) end end
Let’s note a few things about the code inspired by this StackOverflow answer:
Process#forkmethod splits the current process (its current thread) into two copies and the new child process runs the code in the block.
- The child process is then replaced with a newly loaded process using
- The final child process inherits all important settings from the parent process, such as environment variables, open file descriptors or current working directory. This is why we can simply run
bin/railswithout having to set up the correct ruby first (even when using a ruby version manager such as
chruby) and without specifying an absolute path to the Rails binary.
- Because the code in the block uses shell redirection, the child Rails process is not executed directly but using a standard shell (usually
/bin/sh). Redirection allows us to debug and monitor what is going on in the rake task.
- By default, the operating system expects that the parent process is interested in the child process termination status. We are not – we want to run the rake task and forget about it, the task handles everything else such as sending the final email by itself. That’s why we call
Process#detachto let the OS know we don’t care about the child process and to prevent accumulating zombie processes.
If we wanted to make our code more portable (usable on Windows, for example), we would have to use
Process#spawn instead of
fork, as suggested in the ruby documentation. The
spawn method also allows to fine-tune the child process environment, file descriptors, limits or working directory.
An almost equivalent way of scheduling the rake task using
spawn could be written this way:
class BulkInvoiceExportsController < ApplicationController def create child = spawn("bin/rails export_invoices FROM=20220001 TO=20220100", %i[out err] => %w[/tmp/bulk_invoices_export.log a]) Process.detach(child) end end
Please keep in mind that triggering such a long-running process from the controller is not safe. In the previous examples, each request to the
create action of the controller leads to spawning one external Rails process, consuming perhaps a substantial portion of the CPU and memory resources and opening more connections to your database servers. This is a setup very vulnerable to DoS attacks.
The technique is probably OK only in very controlled environments such as in an internal admin area accessible to a limited number of people who know what they are doing and when the function is used only sparingly. If we wanted to make this rake task publicly accessible (as in a ”data take out“ function, for example), we would definitely resort to a real queuing system such as those mentioned above or perhaps a queuing daemon on the system level (e.g.
atd which can hold the tasks based on the server load).
Anyway, for our use case, directly forking the rake task from the controller was the most pragmatic way to go and we are happy about the result.
If you don’t want to miss future posts like this, follow me here or on Twitter. Cheers!