Split Blog for Split Software

Posted on Jul 30, 2021 • Originally published at split.io on May 18, 2021

A/A Testing for Product Experimentation You Can Trust

#testing #programming

If you have never heard of A/A testing before, or have never seen an articulation of the goals of A/A testing, you might ask, “Why should I take the time to run a test where success is defined as finding nothing significant?” Far from being a waste of time, A/A testing is actually an essential foundation for conducting trustworthy online controlled experiments.

Don’t take my word for it. One of the most lively topics at Split’s Flagship 21, a recent online gathering of development, product, and experimentation practitioners was A/A testing:

Shout-out to the value of A-A tests…you should trust your tools, and A-A tests help instill trust in your exp platform (and/or help you identify issues before you run that really important test!)

Anthony Rindone

You absolutely need to do it the first time you build a new experimentation infrastructure. Then, it can be helpful to have a long-running A vs A, but it’s typically more just when you’re building out new infra.

Jean Steiner

I typically suggest another A/A test if a new test targets cohorts in a new way just to make sure that there’s nothing askew with the targeting.

Channing Benson

So, what exactly is A/A testing, then?

A/A Testing Defined

An A/A test is similar to an A/B test, except that after the population is divided up into different cohorts, the cohorts are given the same, not different, experience. In other words, all cohorts are given the “control” experience and none are given a “treatment” experience. Every other aspect of the test, including the means of assigning test subjects to cohorts, ingesting event telemetry, attributing those events to test subjects, and calculating metrics, is kept identical to the planned A/B test.

Why Perform A/A Testing?

A/A Testing to Prove Trustworthiness

The primary goal of A/A testing is to determine whether the tooling that splits test subjects up into cohorts, gathers telemetry, and calculates the metric results is trustworthy. This area must be addressed before we venture into the secondary goals, so let’s jump right in.

Sample Ratio Check: Are You Getting What You Asked For?

If you plan to split users up in a 50/50 ratio (or 90/10 ratio, or any other ratio), does the actual number of users seen during the A/A test come close to that ratio? If not, there is either an issue with how you are calling the testing infrastructure from within your code that makes it “leaky” on one side or the other or there is an issue with the assignment mechanism in the test infrastructure itself. This problem is called a sample ratio mismatch error (SRM error) and it can’t hide from an A/A test. If you see 65/35 coming back, you have need to dig into the problem before wasting your time with an A/B test that uses the same targeting mechanism.

Randomization Check: Is it Eliminating Bias?

For the result of an A/B test to be meaningful, we need to ensure that there will be no bias between the assignment of treatment and control users. If randomization is not effective, we may see a carry-over or residual effect where previous experiments on the same users bias their behavior in the current experiment. Introducing a new randomization seed for each experiment should reduce the risk of this. Re-using hard-coded lists of sample populations all but ensures it will happen. An A/A test is the most effective way to surface this problem before it spoils the value of an A/B test.

Telemetry and Attribution Check: Do We See The Data?

In an A/B test, the point of assigning users into cohorts is to expose them to different experiences and then observe the differences in behavior. This may sound obvious, but before we begin an A/B test, we need to verify that we can reliably ingest an event stream of the behaviors we want to observe. We also need to ensure that the events are effectively attributed to the test subjects. In other words, if we plan to test a new search result algorithm, have we plumbed a data stream to the A/B testing infrastructure that tells us, at the user level, each time a user makes a query, and how many results from each of those queries each user clicks on? For this, the A/A test is as much a forcing-function/readiness check as it is a validation step.

Metric Calculation Check: Is Anything Showing as Significant?

This is the most intuitively straightforward aspect of an A/A test: if we give two similar populations the same experience and yet our test infrastructure declares that there is a significant difference between the behavior of the two populations, that’s a concern!

Bear in mind that if the odds are 5% that a metric may show significance in the face of random behavior (a common configuration threshold for false positive or Type I errors), then a test with 20 observed metrics could result in at least 1 false-positive result even if all is well.

A/A testing helps us see what “results” appear when there is supposedly no difference so that we can either tighten up the configuration parameters, employ multiple comparison correction to reduce the chance of false positives leading to false discoveries, or adjust our success criteria (i.e. insisting that at least 3 independent metrics confirm success instead of just one).

A/A Testing To Calibrate Business Expectations

If you have passed the trustworthiness tests above, Congratulations! You now have the ability to observe users passing through an experience, gather event data attributed to those users and calculate trustworthy metrics from those events.

This brings us to the secondary aspect of A/A testing: calibrating business expectations for A/B tests.

Baselines: Where Are You Starting From?

Before you can establish success criteria for your A/B tests, you need to know where you are starting from. What are the “current” values of the metrics, before you run any experiments? By running A/A tests, you will have the baselines for your metrics in the same system that you will later look to observe changes in. These “before” numbers anchor the expectations of your stakeholders. If there is any debate about whether these metrics are valuable to the business, it’s better to resolve that before moving on to A/B testing.

If you are looking to run a rigorous experimentation practice, consider going beyond baselines for the metrics you intend to improve to include baselines for metrics you are committed to not degrading. A/B tests that focus on the latter are called “do no harm” tests and the latter set of metrics, whether in a growth-focused or do-no-harm focused test are often referred to as guardrail metrics.

Power: The Role of Volume and Variability

The final thing I’ll cover here is the role of A/A testing in identifying the sensitivity to metrics changes that your testing infrastructure is capable of detecting. This isn’t so much about the system you are using as it is about the volume of test subjects (i.e. users) you can send through an experiment in a given amount of time and how much the test subjects’ behavior varies throughout the population. These things matter because our goal in the A/B tests will be to determine if we’ve seen differences in metrics between the A and B populations that exceed the random noise of typical behavior.

If your A/A test reveals that the portion of your application you wish to test is blessed with a very large volume of traffic exhibiting a narrow range of behaviors (i.e. not a lot of outliers or unevenly distributed behavior patterns), you live in fortunate circumstances and will be able to reliably detect even very small changes to your metrics relatively quickly. Less volume and more variability of behavior mean the changes to metrics must be larger to be credibly detected and those changes will take more time to prove out.

Now You Know Why A/A Testing is Essential

Hopefully by now, I have succeeded in convincing you that A/A tests are indeed essential to any experimentation practice. From here, you may be asking some more general questions about experimentation platforms or more specific follow-up questions about when and how to conduct A/A tests. Here are my suggestions for further reading/viewing:

Running an A/A Test Knowledge base article
Creating and Running an A/A Test Video (10:19)
Understanding Experimentation Platforms by O’Reilly and Split eBook

Whether you are exploring experimentation practices or well along the journey and looking to advance and ramp-up experimentation practices in your organization, you can count on us to continue creating new content and events to support you on your path. Follow us on Twitter @splitsoftware and subscribe to our YouTube channel to stay in the loop!

DEV Community

A/A Testing for Product Experimentation You Can Trust

A/A Testing Defined

Why Perform A/A Testing?

A/A Testing to Prove Trustworthiness

Sample Ratio Check: Are You Getting What You Asked For?

Randomization Check: Is it Eliminating Bias?

Telemetry and Attribution Check: Do We See The Data?

Metric Calculation Check: Is Anything Showing as Significant?

A/A Testing To Calibrate Business Expectations

Baselines: Where Are You Starting From?

Power: The Role of Volume and Variability

Now You Know Why A/A Testing is Essential

Top comments (0)

Read next

Sharding Jest tests. Harder than it should be?

How the new concepts of JSSugar and JS0 are able to slow down websites

What's new in Flutter 3.27

The power of MIPROv2 - using DSPy optimizers for your LLM-pipelines