Yarden Porat for Codux

Posted on Jan 26, 2023

Flaky Tests, and How to Deal with Them

#watercooler

Intro

Hey! My name is Yarden Porat, and in this article, I will explain what flaky tests are, their costs, their causes, and how they harm your work and organization. Once we have that figured out, I will share our strategy and tools we have developed in-house for dealing with test flakiness at Wix and how we avoid their costs.

What is a flaky test?

A flaky test is an automated test with a non-deterministic result. This is a way of saying that a test sometimes passes and sometimes doesn’t, inconsistently, without any code changes.

How often does it fail?

Beautifully depicted in this article, if a single test has a failure rate of 0.05% (0.0005), and you have 100 of these tests in your test suite, it would have a success rate of 95.12%= 0.9995¹⁰⁰.

But what happens when you have thousands of these tests? A 60.64% success rate (0.9995^1,000). It’s easy to calculate the significant impact of even a low failure rate on large scale tested applications.

But… What’s the problem? Just rerun the tests!

There are some really bad implications of ignoring flaky tests. Let's go over some of the most common ones, from least important to most.

1. Wasted CI minutes (Hours? Days? Weeks?)

Consider the following scenario:

You are a developer working in a team. There’s a new feature you’ve been developing for several days, and you opened a pull request wanting to merge it into the project.

Now, your company works in a modern development workflow and runs automated tests on your code changes using your CI system. All of the product’s tests ran and failed on a test entirely unrelated to the changes you introduced.

Since you are aware of the code changes you have made, and you know that this project has issues with non-deterministic tests, you therefore know the failing test is not your fault.

So, you rerun the test, and it passes.

If you don’t know your project has an issue with non-deterministic tests, you’ll probably waste even more time investigating.

The problem is that this time accumulates. The longer the test workflows take, the more time is wasted—but how much time? This can be measured, assuming you track your CI test results.

You can easily calculate the CI time wasted due to flakiness by summing up the CI time of a workflow run that had a non-deterministic result. For example:

#	commit	workflow_name	os	result	duration
1	9f3e679	test-part-1	linux	success	10
2	9f3e679	test-part-1	linux	fail	7

Run identifier: commit + workflow_name + os

That’s 7 minutes of CI time wasted!

2. Wasted development time

When a developer reruns a test, they are forced to wait (again) for the build and test time. Precious developer time is being lost.

Even if we assume that a developer utilizes this wait time for other tasks, we still have 2 major drawbacks:
Loss of immediate feedback (long feedback loop).
Context switching—which eats away at focus and productivity.

Unfortunately, this wasted time is much harder to measure.

3. Flaky product behavior (or flaky implementation)

Sometimes a flaky test is only a symptom of a non-deterministic implementation.

The same race condition that can cause test flakiness can do the same in a feature’s implementation, thus causing flaky product behavior in production.

4. Alert fatigue

A common phenomenon in flaky test workflow is the loss of trust in the feedback you are getting. Consider this scenario:

You push your new code
Workflow tests run and fail
“Oh it's that annoying flakiness again; we should fix it sometime”
Rerun workflow, tests run and fail
“D*%N FLAKINESS”
Rerun workflow, tests run and fail
Realizing that it was actually my code changes that had failed the tests
Go back to step 1

This harms development velocity and the developer’s experience. In an environment where it's not mandatory for tests to pass to merge a pull request, it is not uncommon for changes to merge even though they are breaking some product and tested behavior.

What's lost?

Money (Developer time, CI time)
Development velocity
Confidence in tests (regressing to manual testing)
Product quality
Developer experience

Causes of test flakiness

So now that we know the price and the pain, here are some of the causes of test flakiness.

1. Poorly written test code

For example, interacting with DOM elements that are not yet ready, or improper use of waitFor functions. This is the most common case where testing is done incorrectly. Sometimes, powerful development machines (a.k.a, your local computer) hide race conditions in a test, which ends up failing on CI machines.

2. Poorly written application code

As mentioned above, sometimes the application code itself introduces a flaky behavior. These cases are much harder to detect and debug. It could be related to communications, asynchronous code, or many other alternatives.

3. Infrastructural causes

There are various environmental causes to blame, and they are the immediate culprit for those who write flaky tests. Such causes may be:

Network issues: loss of connectivity, slow connection, etc.
Hardware issues: low-performance shared virtual machines, which stress existing race conditions
External dependencies: package manager (npm\yarn), runtime setup (i.e. Node, and other dependencies, which also suffer from some level of flakiness

4. Test tools that are prone to flakiness

In our experience tests which use a browser are more prone to flakiness. One reason is that the browser itself is a complex piece of software with many dependencies, and it can be affected by a variety of factors - its version, operating system, and other specific configurations of the machine it is running on.

Key takeaways up to this point

Here are some points I think you should keep in mind:

Flaky tests could occur due to many reasons and various causes. Some are test related, some production-code related, others from the infrastructure and development environment.
They have direct and indirect implications on the development process—both technical and psychological.
Flaky tests reduce development speed and quality if left untreated.

How to deal with flaky tests?

1. Collect data

It is much easier to communicate the costs of flakiness to your team or organization if you have data to back you up.

2. Analyze it

Workflow reruns per day bar graph

A bar graph that represents the overall flakiness and displays the total number of times a workflow has been restarted.
It helps us understand the scale of the flakiness problem and the lack of developer trust in the tests/CI.

At Codux we chose to count any case of workflow rerun, but you can also create a subset of this graph that shows reruns that never succeeded, which could better depict the lack of trust in your tests/CI.

This is a general index that tells if your data correlates with your general feel of flakiness. We don’t derive tasks from it.

We count rerun by identifying the commit, branch, OS, and workflow name. We call it an “entity” and count its total occurrences minus one.

Fail rate table

This is a table that calculates a test’s fail rate out of its total runs. We collect data from all branches, including development branches, and present only tests that have failed on 3 branches or more, with a minimum number of total runs.

This table helps us find the current culprits tests. A Flaky test that fails over an arbitrary percentage of your choice (we chose 5%), is skipped, documented, and assigned to the relevant developer. This process occurs 1-2 times a week.

This process requires reasoning and shouldn’t, in our opinion, be done automatically — for example:

Some features have a low number of tests, so you probably wouldn’t want to lose coverage, and you might prefer, or should, add a retry on those tests.
Some tests are more prone to failure (during development), such as end-to-end tests, so it might indicate they have a higher fail rate than they actually do.

Fail by test scatter plot (Environmental factors)

We’ve created a plot similar to Spotify’s Odeneye. This plot helps us realize if there are some environmental or infrastructural problems. If you suspect your infrastructure is causing flakiness, try creating this dashboard.

_
Horizontal lines indicate that a test is flaky. Vertical lines indicate an issue external to the test because it shows multiple test failures in the same timeframe. _

3. Run new tests multiple times

After noticing that newly created tests are flaky and require adjustments, we have decided to raise the bar for newly created tests and created “check-new-flaky” — a CLI tool that detects new tests and runs them multiple times.

It detects new tests by running our test runner (mocha) programmatically, recursively extracting test names on the branch, and comparing them to master.

Checking newly created tests reduced the new flaky tests added to the application and the need to refactor them significantly.

Some more benefits that we got:

Faster feedback loop: This test workflow runs your new tests immediately, thus letting you know if it passes without waiting for their turn within the entire test suite
Another OS is running your test: All our tests are running on Linux, while tests/features which are considered to be operating system sensitive, also run on Windows. Using the check-new-flaky CLI, we sometimes get an indication that a test we thought wasn’t OS sensitive is actually sensitive or broken for the other operating system.

4. Set a bar for when a test isn’t flaky

At first, it wasn't really clear to a developer when he fixed a flaky test. Developers would usually run a test 2-10 times before it would be labeled as not flaky and get merged to master.

Once we declared war on test flakiness, the bar would be set to 100 consecutive runs.

There are many ways to run a test multiple times — we used parts from the above CLI (check-new-flaky) and made it accessible via our GitHub bot.

Does your test only fail when running on CI?

CI machines usually have reduced performance compared to your local development machine, thus most race conditions only show once tests are running on the CI.

Helping tests fail on your local machine

One tool that we have found to be helpful is CPU throttling emulation.
We use Playwright for integration and end-to-end browser tests. It emulates slow CPUs using the Chrome Devtools Protocol (experimental feature).

  import type { ChromiumBrowserContext } from 'playwright-core';
   ...
   const client = await (page.context() as ChromiumBrowserContext).newCDPSession(page);
   await client.send('Emulation.setCPUThrottlingRate', { rate: 2 });

Rate is the slowdown factor (1 is no throttle, 2 is 2x slowdown)

Find out what’s going on with a test on the CI

Many testing tools today allow you to take some recordings of your tests.
Playwright released a tracing feature on version 1.12, which records the test flow and provides us with screenshots and DOM snapshots. Since we had a significant issue with flaky tests, we immediately integrated this feature into our testing utils, allowing developers to record runs.

We send CI tracing to a dedicated Slack channel for ease of use.

This feature is super helpful when you have no clue why the test is failing on CI. Tracing helped us catch some unimaginable bugs that we wouldn't have caught otherwise.

Stop using ElementHandles. Start using Playwright Locators

Following the release of Playwright Locators and ElementHandle being discouraged from use, we decided to migrate our test kits and test drivers to Locators to enjoy the benefits given to us by this new API: actionability check, more strictness (detailed below), and in our React application - reduced flakiness.

From our experience, we can say that simply replacing ElementHandles with Locators in a test can resolve flakiness by itself.

What's wrong with ElementHandles?

Each ElementHandle refers to an actual specific DOM node. React, when trying to reconcile changes, might replace these referred DOM nodes. This is happening due to changes or as a result of components being unmounted and remounted again, making the referenced ElementHandle irrelevant. Keeping references to specific DOM nodes is not really needed because we usually get those references with selectors — which are agnostic to specific DOM nodes.

How Locators help us to get the correct DOM node

Locators keep the selector itself rather than a reference to a specific DOM node.
Upon action (e.g .click()) the locator:
- Uses the selector to query the DOM node relevant for that try
- Verifies it is actionable (attached, clickable, etc.)
- Validates there is no single-multiple mismatch.
- Repeats the process until it succeeds.

The actionability validation is batched along with the action itself as an atomic action.
For example, an atomic action could be: check if the button is available, visible, clickable and only then click it — meaning less communication between node and the browser.

By doing the query and validation alongside the action, we prevent possible race conditions that could occur between waitFor -> client re-render -> action.

Some more benefits

Increased strictness: default locator will throw an exception if the selector matched more than one element.
More readable errors: depicts the exact issue of why an action cannot be done instead of a failing assertion or some generic timeout.

Final words

Battling flakiness isn’t a short-term thing.

It requires developers' awareness and care, writing tests more carefully, keeping in mind possible race conditions, and a conscience that tells them it isn’t okay to just rerun tests.

It requires assistive tooling for testing the test itself and monitoring it.

It requires priority, time, and guidelines — things you should receive from the technical management, thus requiring them to be aware of this issue.

A single developer cannot change the state of flakiness — a group effort is needed.

Flakiness is a manageable long-term battle. Empower yourself with the right tools to not only increase your development velocity, but also elevate your overall experience.

Sources:

Top comments (4)

10x learner • Jan 26 '23 • Edited

Really interesting article !! I didn't know about the term Flaky test until now... I will definitely use it from now on 😄

Your tool check-new-flaky must be so useful ! Is there any chance to see it become available to the public. Personally I don't do a lot of web development, but I know that it could interest some people in the community ! 😉

cloutierjo • Jan 29 '23

Very great and detail article. I was hit by the notification fatigue last week, one of our end to end test do fail about 10% of the time, we know the issue lie in some bad test precondition that was solve with a sleep, but sometime would need a longer sleep... And then last week we had to quickly solve a production issue and didn't bother confirming if the test failed for that issue or for a real new issue. Turn out i had completely disabled a feature and a whole test suite was failing. Fortunately our qa found it, but it mean we had to delay that initial major fix to revalidate the release!

Flaky test are dangerous in the fact that it make test failure normal.