Identifying unstable tests in Go

#go #testing #idempotency

Mature Go projects often have a lot of tests, and not all of them are implemented in the best way. Some tests exhibit
slightly different behavior at times and fail randomly.

Such unstable tests are annoying and kill productivity, identifying and improving them can have a huge impact on the
quality of the project and developer experience.

In general, tests can flake in these ways.

Dependency on the environment

Tests would pass on one platform and fail on another. This is somewhat easy type of instability, because you can reliably reproduce and investigate the issue on a specific environment.

Dependency on the initial state

Tests would pass on first invocation and fail on consecutive invocations. This behavior violates idempotence which is favorable to have in tests. Usually it is caused by global state that is populated during initialization and that broken by incremental changes from consecutive test runs.

For example if you rely on global in-memory cache and assert misses during test. You would only see misses in the first test run.

Such tests can be identified by running suite multiple times.

go test -count 5 .

In some cases, this failure behavior is expected and desirable, though it reduces the usefulness of go test tool by noising -count mode. You would need to run the test suite multiple times with -count 1 and compare results to identify cases that flake on first invocation.

Dependency on undetermined or random factors

Tests would pass or fail randomly.

Such behavior almost always indicates a bug either in test or the code, and it needs a fix. Instability can be caused
by data races which lead to subtle bugs and data corruption.

Running tests with race detector may help to expose some issues. Beware, that enabling race detector applies a noticeable penalty on CPU and memory usage. For example if you had a test that asserts allocations of a particular activity, it may fail under race detector for allocating too much.

Also, race detector can help in finding races only in runtime presence of those. It works best if you explicitly use concurrency within your test, e.g. running same actions in separate concurrent goroutines.

go test -race -count 5 ./...

Same as for data races, there is no guaranteed way to expose flaky tests that depend on random things. Running tests
multiple times increases chances to hit an abnormal condition, but one can never be sure all issues have been found.

One data race can affect many test cases and "spam" the test output, this can be solved by grouping data races by tails
of their stack traces. There is a small tool teststat that can group such related races using -race-depth int. The bigger race depth value, more race groups are reported.

Go test tool offers machine-readable output with -json flag, teststat tool can read such output and determine racy,
flaky or slow tests.

go test -race -json -count 5 ./... | teststat -race-depth 4 -

Another way of using it can be by running test suite multiple times and analyze reports together.
This can help to expose tests that flake on first invocation.

go test -json ./... > test1.jsonl
go test -json ./... > test2.jsonl
go test -json ./... > test3.jsonl
teststat test1.jsonl test2.jsonl test3.jsonl

Resulting report can be formatted with -markdown to make it more readable as issue/pr comment.

Slow tests

Tests that take a long time to run are not unstable, but they are annoying and wasteful in the long run.

Often slow tests are caused by timeout-driven orchestration of concurrent operations. Such approach seems simple to implement, comparing to more performant channel/event-based orchestration, but those sleeps adds up making test suite slow.

You can use teststat to report slowest test cases.