DEV Community

Cover image for When NOT to use A/B tests
Daniel Macák
Daniel Macák

Posted on • Edited on

When NOT to use A/B tests

A/B testing is de facto the standard for determining which experience (treatment or control) is better. If you can, I definitely recommend running them, as it makes your launch/no-launch decision grounded. But even with enough users and proper setup, there are cases where A/B testing is not the answer. Explore with me why this might be the case and what are the alternatives.

When A/B is not the best idea

Sometimes it's better to use a different experimental method, other times an experiment is not needed or even possible at all. Let's see the cases in which a developer like you and me can question the need of A/B.

Not enough units

As I have highlighted in my previous article, one doesn't need that many users to run A/B tests. However, it might happen the circumstances lower the number of experiment randomization units (usually users) so much that A/B test can't be run.

Picture that you have enough users, but a certain experiment requires that you randomize by accounts instead, say because the experience must remain consistent between account members. Depending on the average size of the accounts, this can dramatically reduce number of units on which you can experiment, and it could take too long to gather any significant results.

Just launch already!

Here we are talking a situation where it's clear that the treatment experience is superior.

Imagine you have an online product which users can procure on your payment site. The product is great but the payment site not so much because when the payment fails, it gives the user no explanation why. This is frustrating since users don't know what to do next. Call the bank, recheck their billing details, or something else?

Now let's say you decided to solve this problem by displaying a nice dialog with an useful explanatory message and your first instinct is to A/B test it. You might think that users hate dialogs, therefore you would like to be sure it performs well. Or you just got into the habit of A/B testing everything and you love the dopamine rush when you've improved key metrics. No worries, it's addictive alright.

But the thing you might not realize is that while running the experiment, say for 2 weeks, the control group is exposed to objectively worse experience. Not only are your users frustrated but your business is losing them every day. While a dialog might not be the best UI widget to communicate payment failure, it's miles better than having no failure communication at all. In other words, you make the UX worse only to receive test results that are clearly expected. And since the launch/no-launch decision is clear upfront, and there is no learning to be made, running an A/B test is a waste here.

Large UI redesigns

Unless such redesigns are approached in step-wise, widget-by-widget fashion, big bang introduction of a new design probably doesn't make sense to be A/B tested.

Design phase

That doesn't mean you shouldn't keep control in other ways. First of all, you should have a clear goal in mind. Is it just a design refresh, or a larger overhaul of the UX to reduce friction? And is redesign needed at all? User Research will help you clear up these questions.

Whichever way you decide to go, it's always a good idea to start testing your designs asap to uncover problems and frictions, usually by using qualitative methods. So far, no quantitative experimenting is generally needed, unless you'd like to target specific areas of the UI and compare pre and post redesign components. This is not always possible as introducing newly designed components might be perceived as disturbing in context of the old design, but can otherwise be great at providing early learning and confidence in the redesign's direction.

Roll out phase

Generally it makes sense to gradually roll it out by traffic % and segments, eg. first to beta users, afterwards to new users only. So now is the time to launch the A/B test right? Not really, consider this:

  • The treatment UI can be so different from the old that you would be comparing apples to oranges
  • For that reason, your findings might not be actionable. If a certain widget underperforms, is it due to the widget itself, its surroundings or something user has/hasn't seen before?
  • If treatment underperforms as a whole, what do you do? Revert the treatment roll out until you fix it? How are users going to react to their UI switching from new to old and back? And isn't it too late for such findings anyway?

That brings us to an important point - A/B tests actually align with the Agile philosophy. They should make it easy, cheap and quick to test ideas and throw them away if they don't perform. A/Bing whole new UI after months of work isn't cheap nor quick, and lot of times the interested parties know that throwing away the new shiny UI is just not an option either.

Therefore, don't allow it come this far without knowing your redesign is a good idea. Validate it and test your concepts and designs early to keep the cost down should your approach be wrong. Once you start rolling out, by all means track both old and new UI and even compare them to eg. discover seasonality effects on your metrics.

If no A/B, what then?

Let me start by saying that even though A/B might not be possible in certain cases, you still have to retain control. It goes without saying that you should have solid monitoring in place in order to detect failures of all kinds. But equally important, you need a comprehensive tracking to keep an eye on the important metrics. If the number of purchases drops below acceptable threshold, you need to know about it asap in order to properly react.

But besides that, how do you make that launch/no-launch decision without an A/B test?

Interrupted time series (ITS)

Enter ITS. It is a quasi experimental design, one of most more reliable and universal to use in this class. I say quasi because control and treatment can't be randomized properly with it.

It works by measuring period before intervention (control), projecting it into the future (counterfactual) and comparing it with measurement after the intervention (treatment). The difference between the projection and actual treat measurements is the outcome of this method. The intervention is performed for all your users, that means control and treatment never exist at the same time.

Interrupted time series chart

The forecast should be done using an ML model to get the best precision.

Using ITS, you can work around cases where A/B is not possible or desirable and still compare control and treatment, yey! However, as most things in life, ITS is not a silver bullet, far from it actually. All quasi experimental designs have a common weakness - it's very easy to make a mistake and completely invalidate the results. In fact, it's not uncommon for quasi experimental methods to get their results completely refuted by a proper experimental designs later on.

In the case of ITS, you have to account for confounding effects, seasonality being the most obvious. To mitigate, the interventions can be toggled multiple times to explore seasonality's impact on the results. Still, there can be other confounds at play, and as such ITS requires great care when designing to discover those early on.

Long term holdouts

This term describes a situation where the new experience is rolled out to most people, typically around 90%, and the rest is left with the old experience for a longer time.

As you have probably guessed, this is in no way an experiment approach, but rather a verification step. It is usually used to spot novelty or other long-term effects (and as such can be used after an A/B test), but in my opinion it can be also used to give a peace of mind when rolling out treatment experience to almost all users without prior verification in an A/B test. An example could be that if you launched to 100%, and suddenly met with a dip in important metrics, you might be unsure if the treatment experience or other factors like seasonality are to blame. In contrast, having a small long term hold out might clear this question up.

However, it's important to say that withholding treatment from the holdout users, knowing it is superior, can be unethical, so use it wisely.

Conclusion

There are many cases where A/B tests are not possible or desirable. While it's definitely possible to keep control over launching new experiences without them, they are still superior to alternative methods of conducting experiments and should be preferred where possible.

Let me know what you think and I'll see you in the next one ❤️.

Sources

Kohavi R, Tang D, Xu Y. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press; 2020.
https://towardsdatascience.com/sample-size-planning-for-interrupted-time-series-design-in-health-care-e16d22bba13f

Top comments (0)