This article is Japanese -> English translation (thank you DeepL!) of the following post (and some additional messages):
I'm a programmer and one of the full-time commiter of the Ruby interpreter (Homepage of Koichi Sasada) at Cookpad Inc. We hope you enjoy the recently released Ruby 3.2.
Ruby interpreter is a complex program, so it naturally has bugs, and Ruby interpreter developers are taking various countermeasures against them. For example, we write tests and check them in CI environment (This is the result of daily maintenance of the test environment, such as RubyCI, chkbuild, ruby/spec: The Ruby Spec Suite aka ruby/spec and machines).
In addition to this, I have a group of machines that I personally run a lot of test iterations on. The purpose is to improve the quality of Ruby interpreter by running tests as often as possible to find bugs that occur only occasionally. In this article, I'd like to introduce about such an uncommon test environment.
To prepare this test environment, I have received support from various people. In this article I would like to express my gratitude for their support.
In a typical CI/CD context, you run tests on each commit (PR), because if there is a problem, you know there is a problem in the commit. GitHub Actions and the like often target such tests.
To deal with it, Ruby interpreter core team prepares and uses the following test environments.
- Per-PR and per-push testing environment with GitHub Actions
- Periodical test environment with chkbuild (The results are available on rubyci).
Both 1 and 2 are basically mechanisms to check for problems with the last modification.
To get accurate test results, chkbuild makes an Ruby interpreter from scrach and run tests every time under various operating systems and CPU architectures. However, since it takes time, it is executed about once every two hours on a machine.
Currently, most of the compute resources are built on AWS with support from Ruby Association and other organizations. GitHub actions are provided by GitHub.
Speaking of which, Shopify has been kind enough to test their (presumably huge) application with a development version of Ruby. That's very helpful.
Ufuk KayseriliogluSo, throughout this year we've been investing into the stability of Ruby. Starting early September 2022, we increased our efforts even more by running a Ruby HEAD checkout every hour on our Core monolith CI. We also setup an early access ARM cluster to test the new YJIT backend.23:26 PM - 21 Dec 2022
debug versions of setup-ruby
If you have projects which have tests running on GitHub actions with setup-ruby, please consider to add
head: nightly build
debug: nightly build with assertions
If you find out the strange behavior, please feel free to report it to the https://bugs.ruby-lang.org/issues/. This kind of contributions are very helpful for us.
When you have a software as big as Ruby interpreter, you may encounter a phenomenon that it sometimes fails even though nothing has changed. It is also possible that there may be last minute changes, but the test fails for reasons not contemplated by the modification. This is sometimes called a flaky test. There are several possible reasons for this.
- Bad test.
- Tests caused by external factors such as "time" and "system status
- Bugs are already mixed in, but only if you're unlucky (or lucky) enough to find them.
The most common experience is that "1. bad test". For example, if it depends on the order of the tests, it can cause problems at any moment. If you're writing tests that are timing-sensitive, they can be a bit off and sometimes fail (or fail because of a change in machine spec).
Tests caused by external factors in #2 are also sometimes bad, though it's not impossible to say that the tests are bad. For example, the example shown in ruby/zlib tests started to fail even though I didn't do anything - @znz blog (written in Japanese), where the timestamps generated specific data at a specific time, and the tests failed (fixed the tests and solved the problem).
Well, the above are in the category of "bad testing", so they are not directly related to the quality of the interpreter itself. However, if we leave these things unmodified, you'll have a hard time checking the test results, so you need to fix them as soon as possible. It's the Broken Windows Theory.
Even a bug that appears once in 10,000 times by bad luck may be stepped on once a day in a software with 10,000 users a day. Or rather, you step on it. If it is worse, it may become a source of vulnerability.
This kind of bug is likely to appear in the following situations:
- Automatic memory management (GC)
- An algorithm using cache
- Parallel and concurrent executions
- Networks and other external systems
All of these are areas (and areas I work with a lot) where it's easy to bring in non-deterministic behavior, i.e., behavior that doesn't produce the same result even if you run it twice. There are many other things that can cause a "Hey, that's not the same result as before?" situation, such as memory address randomization by the system.
So, there are many ways to tackle on this, we use the method of "just try many times". It's simple: if it appears once in every 10,000 times, then if you run it 10,000 times, it will be reproducible.
In other words: "Bugs that don't appear very often are already in the code → the more test trials we run, the higher the probability of stepping on such a bug (we can reveal it out)."
The chkbuild I mentioned earlier runs about 12 times a day (and multiplied by the number of environments), and the GitHub actions runs every event, so it's not enough to say "run a lot". So I made my own test environment and have been running it for about 5 years.
It originally started when I got fed up with "occasional" bugs in debugging on the newly designed GC development, and ran tests all the time on a single machine.
When I first started, I used the command
while make up all test-all; do date; done to run the test indefinitely (stopping if it failed). However, with this, I have to look at the terminal to check the results, and I could not know if the test stops unintentionally. Also, it's hard to scale, so I had to build my own test environment.
In order to run a lot of tests, we devised the following
- Use multiple machines (scale-out)
- Use machines with better performance (scale-up)
- Run multiple tests simultaneously on a single machine to use up hardware resources
- Shorten the time of a single build and test run
Here is an introduction to each of them.
If we have enough money, the most reliable (and easy for the savvy) way to scale out is to prepare lots of machines on the cloud, but since this is a private activity, there is a limit to how much money I can provide. Also, some of these types of uses that use up computing resources are not suited to cheap cloud services.
Since I have some space at home, I'm currently hosting the actual machines in a temporary location (this activity will likely end there as the children grow older and they needs their own rooms. Please remember that Japanese home doesn't have enough space to maintain multiple computers).
I've been staring at the price list of AWS and other services, but using real machines is the cheapest... (I wonder if it will be cheaper if I find a discount plan). I'm grateful that I can buy a nice little machine with 8 cores 16 threads for just under 100,000 JPY (about 750 USD). Currently, I'm running 4 machines.
- 1 machine: bought 6 years ago
- 1 machine: ThinkCentre M75q Gen 2 (donated by Garnet Tech 373 Inc.: Garnet Tech 373 Inc. provides development machines to support Ruby interpreter development - Garnet Tech 373 Inc. written in Japanese) about 60,000 JPY
- 2 machine: MINISFORUM EliteMini HX90 (purchased with proceeds from GitHub sponsors (Sponsor @ko1 on GitHub Sponsors)) 2 x about 80,000 JPY
All the new machines are small. I used to have a mid-tower machine in my lineup, but it was getting in the way... I bought the HX90 at the last Black Friday because it was a bit low price.
Test run times correlated nicely with CPU frequency. The faster the better.
If you run one test suite, 2GB of memory seems to be enough even if we build and run each test in parallel (it was my surprise).
I have an electricity meter on, and when I look at it, it seems to go up and down around 400Wh in total. When I look at the Tokyo Electric Power Company's electricity rates Standard Plan. That's just under 10,000 JPY (per month). I'm partially compensated by the earnings from GitHub sponsors.
(By the way, this electricity bill includes three Mac minis used for rubyci/chkbuild, which I introduced earlier; the Mac minis were purchased with the support of Nihhon Ruby-no-Kai).
It's fine now because it is Winter, but during the hot months (since I didn't turn on the A/C) the fan was making a lot of noise. Worried about a fire. So far, it's been running for more than a year, even in continuous operation. But after 5 years, 2 of the middle tower machines broke. The smaller machines seem to have a shorter life span.
The machine cost (leaving aside the old one) is 220,000 JPY, and if it depreciates in 3 years, it is about 70,000 JPY/year. Electricity cost is roughly 120,000 JPY/year. In other words, about 200,000 JPY/year (about 1,500 USD/year). It is cheap because it doesn't need the cost of the place and the maintenance personnel cost. And I don't need SLA because I don't have to worry about the system if the system is down. Well, it is a good hobby to arrange machines at home.
As a side note, one of the reasons I'm running physical machines on hand is to do benchmarking. If you use a machine in the cloud, you may have to mess up your instance, so I prefer to use a physical machine as much as possible. For example, the machine on https://rubybench.github.io/ is the machine hosted at my home (this machine is also provided by Ruby-no-kai Japan, thank you very much). When I need to do serious benchmarking of new features, I stop running tests and and use these machines for benchmarking (because sometimes I need more than one machine for benchmarking).
One way to increase the number of test trials is to launch multiple processes running the test suite on a single machine.
When a test suite is executed, there are times when it consumes resources and times when it is free, so the idea is to improve the overall performance by running another test execution process when one test execution process is not busy. However, if the number of concurrent test processes is too large, the overall performance may deteriorate because resources are consumed in resource conflicts.
Simply running multiple test processes sometimes interfered with each other (e.g. filesystem or network ports), so through some trial and error I figured out that it was ok to tweak some settings in the Docker container. I now have 22 Docker containers running the test suite simultaneously on a single machine (build-ruby/run_sp2.rb at master ・ ko1/build-ruby). The memory is 32GB, which is enough (but I gave up the RAM disk, which will be described later).
In order to reduce the time it takes to build the latest version of Ruby and run through the entire test suite, we've devised the following:
- Reuse compile results etc.
- Using a RAM disk
- Concurrent build and test processing
The test runs on rubyci.org do not reuse any compile results and so on to ensure the test results. However, in this case, the goal is to get more number of tests, so I try to reuse the compiled results aggressively. However, there are sometimes problems caused by reuse, so if it fails twice, it erases all the compiled results and build from scratch.
In an environment with relatively spare memory, I try to use RAM disks (tmpfs) for all build results to speed up the build a bit. However, it's not clear how well this works. The reason is that the OS caches performance-related data in memory on its own. It's just a matter of feeling that it's faster.
The parallel execution of the build is by
make -jN. 10 years ago, there were quite a few bugs caused by this, but now we can build in parallel with almost no problem.
Running tests in parallel, which means splitting up the test suite and running the them in parallel. There are roughly three groups of Ruby tests to run in this environment, one of which has long supported parallel processing (
make test-all). I rewrote one more group (
make btest) to make it parallelizable, with the goal of making it count.
With these efforts, in an environment where you occupy a fast machine and repeatedly run "build the latest version -> run tests" can be done in less than 2 minutes. In other words, because we always get the latest version of Ruby from the repository for testing, if you make a problematic commit that causes the test to fail, you'll get a test failure notification in as little as two minutes (the result is notified to the Slack channel).
Even if a bug is introduced, it cannot be detected if there no code that will step on that bug. This is why extensive testing is needed. Ruby already has a large set of tests.
Also, the Ruby interpreter source code contains a lot of assertions (statements of what should happen at this point in the program). You can think of this as a kind of test. In the parts I code, I try to increase the number of such assertions so that we can detect strange states.
Many of these assertions are only enabled for checking in debug builds. For this reason, we use debug builds to run them in some of the environments we run in.
For testing, ideally, it would be nice to bring a prominent app or library and run its tests on the latest development version of Ruby, but I haven't gotten around to that.
The tests we run are not all the same, but we try to find bugs by running the tests in different patterns.
- Run the test with a Ruby interpreter built with various parameters.
- Run tests with different versions of the build environment (compiler).
- Run the tests in a random order. For example, the method cache status changes depending on the order in which the tests are executed, so there may be bugs to be found there.
- Repeatedly run the test. Similarly, repeating the same test may change the status of the method cache.
I wrote a software to build Ruby and run tests according to the configuration (ko1/build-ruby: Build Ruby from source code). For example, here is a list of settings: https://github.com/ko1/build-ruby/blob/master/docker/ruby/targets.yaml
We have devised several ways to deal with unforeseen problems.
- Recording of all execution logs
- Allow configurable timeouts to prevent infinite stoppages
- When a timeout occurs, gdb dumps the backtrace of the related process
- In case of an abnormal exit that spits out core, you can download the core
- If the failures continue, delete all the data, increase the execution interval, and so on.
However, when a test fails, we often don't know the cause after all. We would like to devise a little more.
I made a site ci.rvm.jp to aggregate the test results. The DB is SQLite3 because the number of viewers is limited (so it is slow). It's a really slow server, so I don't even link to it.
I'm trying to make the output to stderr visible in the summary page of the execution result so that it's easy to understand what's wrong when you look at the failure page (but otehr CI sites seem to be doing this endlessly).
When a test fails, there is a Slack notification (for the Ruby committers to see) and an email notification (just to me). In the rare case of a commit that fails, the notifications are terrible.
There are many possible methods for testing non-deterministic behavior, especially in academic researches.
For example, making any external events demterministics (external events such as input/output and thread scheduling). In other words, you can make sure that the same program (and external input) will always return the same result by using a variety of different techniques. Once a problem is found, if the problem can always be reproduced, it seems to make things easier. However, I've heard a lot about this at the research level, but I wonder how practical it will be.
We can also think of methods such as using formal methods to automatically generate exhaustive tests and data that makes it easier to cover. It would be cool to be able to do this kind of thing.
I did a lot of hard work at first to fix the problem, because with thousands of attempts every day, it's quite a flurry of failures. I worked much harder to fix it, mainly because there are a lot of test flops.
I was also able to fix some bugs caused by timing. Here's the patch I have in my notes.
- fix marking T_NONE object bug. · ruby/ruby@4c9f3ce
fix passing wrong
passed_bmethod_me. · ruby/ruby@3cb6952
In this article, I introduced my personal activity to find rare bugs by increasing the number of test runs in order to improve quality of Ruby interpreter.
Bugs are always present in programs, and it is hard to find bugs in large, complex programs. In this article, I introduced some of the trial-and-error process. I wish I could take a more scientific approach. If you know a good method, please let me know.
As I mentioned in the article, this system is made possible by a lot of support. Especially GitHub Sponsors was important to continue this activity. Thank you again.
As for the machines, a few years ago, a certain company gave me three big rack-mount machines with 3-digit GB memory that became unnecessary, and I installed them in another certain company N. I have been operating the machines including these (Mr. S of company N has been taking care of the machine operation for a long time). The other day, these machines were removed because they were old indeed, so I wrote this article with the memorial service and gratitude. Thank you very much.
Well, I wish you have a happy New Year and enjoy newly released Ruby 3.2!