DEV Community

Augusts Bautra
Augusts Bautra

Posted on • Updated on

How I stopped RSpec from spiking to 2x runtime

TL;DR

  1. Make as many specs as possible be transactional (this can even be done for cucumbers!), especially for shared-example-using files where there are usually many examples.
  2. In those examples that actually write to DB for some reason, try switching from :truncation to :deletion. In our case, running postgres, oftentimes truncation randomly stalled for 2 minutes. Deletion sidesteps this.

The Story

It somehow came to my attention that there's a huge variance in how long parallel RSpec runners take on CI, sometimes spiking from the 12min average to 20min and more, and routinely exceeding the average by several minutes.

This seemed extremely suspicious because we use Knapsack, which should ensure near-equal finishing times for all runners.
Luckily, Knapsack stores run data and I was able to identify common offenders and identify the common thread - the spiking specs were writing to the database and then being cleaned up by DatabaseCleaner. The project had a complex DB setup, so I reached for the lowest-hanging fruit - I tried :deletion instead of :truncation cleanup strategy and it worked.

Image description

In the image you can see CI runs. Each vertical cluser of dots is the spread of how long each parallel runner took. Ideally we'd like to see very little spread, and have the dots as close to taking 0s as possible.

The Magenta line is showing when I merged some long-running spec rework from ones that write to DB to transactional ones. Due to the spread still being there it's hard to see, but it resulted in at least 60s off of the CI run average.

The Green line, however, is why you are here. This is where I merged the change from truncation to deletion. No more random spikes to 20+min runtimes.

Monitoring Is Key

Needless to say, without there being data on how long runs are taking, I wouldn't have had the opportunity to notice there something being amiss (besides sometimes having to wait on CI for much longer than usual). Access to quality data and trends can help prevent problems before they even arise.

Top comments (0)