Recently, DEV was struggling with CI build times. It got to the point where Travis builds were taking up to 30 mins to complete. On top of that, we had a few flaky specs in our build and the result was some very frustrated developers.
When I began trying to figure out how to solve this issue, the first strategy that came to mind was parallelizing the build process. This means splitting up our test suite into chunks and running each of those chunks at the same time. The challenge with this is figuring out how to split the chunks up so that each chunk runs in the same amount of time.
For example, if you have a 30 min build and you want to split it into 3 parallel builds, ideally you want each build to run for 10 min. In order to do this, you have to figure out how long each of your tests takes to run. While grappling with this problem, one of my coworkers suggested I checkout KnapsackPro.
NOTE: KnapsackPro has two modes, Queue Mode and Regular Mode. Currently, DEV is using Regular Mode and that is what I will be covering in the rest of this blog post.
KnapsackPro allows you to evenly split up your tests between parallel CI builds so that they run in the most optimal way to save you time. KnapsackPro does this by recording the time each test takes to run. It then uses that timing data to split your tests up equally in terms of runtime into however many groups you choose. This sounded like a great solution so I started digging into the docs trying to figure out how to get it all setup
Before I go any further, I have to say that the KnapsackPro docs are some of the best I have worked with. They offer a thorough step by step setup plan for whatever language or testing framework you are using. In addition, their FAQ docs cover just about every possible buggy scenario you might run into. All of this made setting up KnapsackPro a straight forward process.
Here are the steps to setup KnapsackPro with a Rails project. I pulled these straight out of the KnapsackPro gem installation guide.
Add these lines to your application's Gemfile and then run
group :test, :development do gem 'knapsack_pro' end
Once you have the gem installed then its time to set up your configuration based on what kind of testing framework you use. To figure out how to configure the gem, KnapsackPro gives you this handy installation guide.
Simply select your testing frameworks and gems and it will tell you what to add to your configuration. In our case, we use Rspec, Webmock, and TravisCI which gave us these additional steps to perform.
Add the following at the beginning of your
require 'knapsack_pro' KnapsackPro::Adapters::RSpecAdapter.bind
We use VCR and Webmock so we needed to add the Knapsack Pro API subdomain to ignore hosts for those configurations.
require 'vcr' VCR.configure do |config| config.hook_into :webmock # or :fakeweb config.ignore_hosts('localhost', '127.0.0.1', '0.0.0.0', 'api.knapsackpro.com') end # add below when you hook into webmock require 'webmock/rspec' WebMock.disable_net_connect!(allow_localhost: true, allow: ['api.knapsackpro.com'])
To make sure everything loads properly, ensure you have require false for your webmock gem when VCR is hooked into it.
group :test do gem 'vcr' gem 'webmock', require: false end
The docs also state:
If you happen to see your tests failing due to WebMock not allowing requests to Knapsack Pro API it means you probably reconfigure WebMock in some of your tests. For instance, you may use
WebMock.reset!or it's called automatically in the after(:each) block, if you require 'webmock/rspec'. These setups will remove api.knapsackpro.com from allowed domains. Please try below to fix this issue:
RSpec.configure do |config| config.after(:suite) do WebMock.disable_net_connect!( allow_localhost: true, allow: [ 'api.knapsackpro.com', ], ) end end
Using the Travis matrix feature we were able to parallelize our builds on Travis with the following updates to our
script: - "bundle exec rake knapsack_pro:rspec" env: global: # tokens should be set in travis settings in web interface to avoid expose tokens in build logs - KNAPSACK_PRO_TEST_SUITE_TOKEN_RSPEC=rspec-token - KNAPSACK_PRO_CI_NODE_TOTAL=3 jobs: - KNAPSACK_PRO_CI_NODE_INDEX=0 - KNAPSACK_PRO_CI_NODE_INDEX=1 - KNAPSACK_PRO_CI_NODE_INDEX=2
Below is the DEV Pull Request that made all of these changes in our repository.
What type of PR is this? (check all applicable)
- [x] Optimization
This PR introduces a service called Knapsack to help us parallelize our spec suite as evenly as possible. The first time I ran Knapsack in our build it ran every test separately and recorded the time it took to run the test. Using this information Knapsack then splits the tests up for us into 3 equally timed groups to run in parallel each time we run our test suite. This is how regular mode works.
The changes in this PR are introducing the gem and setting it up. All of them were made with the help of the Knapsack installation guide which walks you through all the changes you should make to get it working properly.
NOTE - There is currently a bug with the parallelization in Travis that causes the
--local flag for our bundler command to be ignored. This means on your first build, since there is no travis cache, the jobs will likely take 13min. I am in contact with Travis support to get this resolved.
Why aren't we using Queue Mode? Ideally, we want to use queue mode. In Queue Mode Knapsack sends us groups of 3-5 specs at once and then when they finish, sends another group of specs. It keeps doing this until all specs have been run. This is obviously the fastest approach but we ran into some errors with the jobs hanging. My goal is to get the regular version out then try to debug that hanging issue so we can use queue mode.
How much does Knapsack cost? FREE bc we are open source and I must say the founder has been extremely helpful in getting us going and holding my hand through the integration process.
If you click through to the pull request, you will notice we also made some additional changes to ensure our code coverage checks and other CI steps ran efficiently and correctly with our new parallel builds.
With all of those changes in place, you will then want to push your branch up and let the KnapsackPro API do its thing. Keep in mind, the first run will NOT be optimal because the knapsack_pro gem will record the execution time of every one of your tests.
To make sure everything was recorded successfully, you can check the build metrics on your KnapsackPro API dashboard.
Here you can find everything from node build times to a breakdown of how long each test took to run. Once KnapsackPro has that data, then it can strategically split your tests up as evenly as possible for all future builds. Your second test suite run on your CI provider will be parallelized with the optimal test suite split if the first run was recorded correctly.
One hiccup we ran into when implementing KnapsackPro was that it would not work for forks because forks do not have access to our KnapsackPro tokens in Travis. We ended up seeing this error
Missing environment variable KNAPSACK_PRO_TEST_SUITE_TOKEN. You should set environment variable like KNAPSACK_PRO_TEST_SUITE_TOKEN_RSPEC (note there is suffix _RSPEC at the end). knapsack_pro gem will set KNAPSACK_PRO_TEST_SUITE_TOKEN based on KNAPSACK_PRO_TEST_SUITE_TOKEN_RSPEC value. If you use other test runner than RSpec then use proper suffix.
Once again, the great KnapsackPro docs came to the rescue with a section in the FAQ that explained how to get KnapsackPro to work with forked branches.
The TL;DR of the solution is that we had to create an executable file
bin/knapsack_pro_rspec in our main project repository to handle the missing tokens.
#!/bin/bash if [ "$KNAPSACK_PRO_TEST_SUITE_TOKEN_RSPEC" = "" ]; then KNAPSACK_PRO_ENDPOINT=https://api-disabled-for-fork.knapsackpro.com \ KNAPSACK_PRO_TEST_SUITE_TOKEN_RSPEC=disabled-for-fork \ bundle exec rake knapsack_pro:rspec else # Regular Mode bundle exec rake knapsack_pro:rspec fi
Then, in our
.tavis.yml file, we replaced
bundle exec rake knapsack_pro:rspec with
bin/knapsack_pro_rspec. You can see the changes in this PR:
The new script tries to hit the KnapsackPro API, but without the token, it fails. Upon failure, it will fallback on grouping tests by directory names and you will see an output that looks like this:
W, [2020-06-17T08:45:14.412458 #8343] WARN -- : [knapsack_pro] Next request in 2s... W, [2020-06-17T08:45:16.577365 #8343] WARN -- : [knapsack_pro] #<SocketError: Failed to open TCP connection to api-disabled-for-fork.knapsackpro.com:443 (getaddrinfo: Name or service not known)> W, [2020-06-17T08:45:16.577552 #8343] WARN -- : [knapsack_pro] Fallback mode started. We could not connect with Knapsack Pro API. Your tests will be executed based on directory names. Read more about fallback mode at https://github.com/KnapsackPro/knapsack_pro-ruby#what-happens-when-knapsack-pro-api-is-not-availablenot-reachable-temporarily
Grouping by directory name is not quite as ideal as grouping by timing, but it beats not parallelizing things at all.
I said it above and I will say it again, KnapsackPro is extremely well documented which makes getting started with it very straight forward. There is literally a doc for just about every question or scenario you can run into.
One of the big benefits of KnapsackPro is that they give you the option to make your dashboard and test stats public. This is a huge deal for us at DEV because we have so many external contributors. It is amazing when those contributors can access the same data that the core team can.
KnapsackPro is a small company, which means when you send them a support email it goes straight to a real person! No automated response, no bouncing around between different support people with canned responses. You go straight to someone who will be able to help you.
At the start of this, DEV's test suite was in pretty rough shape in terms of flakiness and reliability. It's also worth mentioning that I am a Site Reliability Engineer, not a QA engineer, so I struggled quite a bit getting everything setup. What got me through was the support and help I received from KnapsackPro along the way. Email responses were quick(within 24 hrs) and not only would they answer my questions, but they also offered me tips about how to set things up even more efficiently than I was.
The end result of all this work is that we now have a test suite that runs in about 10 min! In addition, when we come across a new flaky spec, we can simply retry the one job that failed, instead of having to run the entire suite.
The move also forced us to separate our testing process and our deploy process. Now when a deploy fails for some external reason, we can simply retry the deploy step. Before, we would have to restart the entire build and run the whole test suite again before we could deploy. It was not fun.
Devs are always looking for a better and faster CI and KnapsackPro is a great tool that can help you accomplish that.
I was not enticed or asked to write this blog post by anyone from KnapsackPro. I know many companies struggle with slow test suites and I wanted to share how we tackled that problem at DEV so hopefully, others might be able to do what we did to solve their own challenges.