Adrian B.G.

Posted on Jan 4, 2019 • Originally published at coder.today

A/B tests for developers

#meta #testing #webdev

This article is only the first part of the main story from my blog: A/B tests developers manual.

Best case scenario: Your [product owner,boss,producer] found out about A/B tests and you are here to learn more about how to implement them.

Worst case scenario: Your product is already a mess because of A/B tests, and you want to clean it up.

Either way, I’m writing this article so you do not have to repeat our mistakes.

In the last 5 years I worked mostly in the gaming industry. I had to implement hundreds of A/B tests and I learned it’s a powerful 💪🏾 tool. In the same time I learned that if you do not pay enough attention, your code transforms in a spaghetti 🍝restaurant.

I wish there would be a single simple 🎯way to implement A/B tests without making a mess in your code,but I don’t know any. By definition your code needs to have multiple versions of the same behavior.

Intro ⚓

Skip this block if you are already familiar with A/B testing.

A/B test is also called multivariate testing, A/B/C/D testing, split testing or bucket testing. It is an iterative process of experimentation, that helps you find out what is better for your product. More formal definitions: here and here.

Your product (game, app, website, shop …) can grow in 2 ways:

a person says “feature X will improve the Y KPI by 30%”, and you implement this feature. We, the mortal humans cannot predict the future, we can only guess it.
a person says “this X feature is the best”, other says “Y feature is the best”. You implement X,Y,Z variants and measure exactly which one is better.It may be none, one or more of them. You keep the best versions and improve them with further split tests.

The tests are done on smaller, but representative samples of users. This way you can test multiple versions in parallel and mitigate the effects. You do not know how it will affect the user behavior, so you want to minimize a possible damage.

You start by distributing your users into buckets, each bucket will provide a different user experience and you collect the data and analyze the impact of each version. You choose the best bucket and roll-out it to all the users. The process is more complex than this, but this was the main idea.

Basic example: you want to find out the price point of a new product you can make a test. 50% of the users are left out of the test, with $5 price. The remaining 50% are split equally into 5 A/B test versions ($5 control group, $10 version 1, $15 version 2, $20 version 3, $25 version 4).

Professional commitment ✍

I, the developer, swear not to be biased. We cannot allow any personal/technical difference or issue to affect the A/B test result (user behaviors) for example:

loading— make sure all the resources are loaded from the same source (CDN/hosting), so the network times are similar for the users
size— make sure the file sizes are the same for all the versions, 1 button has 1mb background and the rest are 50kb

…you get the picture. All the users must start from the same premise. If a technical irregularity appears please let the team know and repeat the test.

We, the developers are in charge of the implementation and technical details, we must guarantee the technical unbiased and act as a firewall.

"I, the developer, swear to collect the right data from the users and do not mess with the tracking." Easy to say, hard to do. The main idea is that the A/B test result is based on the KPI events, by observing users behavior during the split test. Data anomalies can mean “bugs” or a clear “winner”.

Usually when something is too good to be real, it is a bug.

A. Good to have 🛠

It’s easier to work with a better codebase. If your code is already a mess, the A/B tests will multiply it.

Modules If the code is already split to modules (modules, files, classes) most likely you will only need to modify 1 portion of your code, if not … then you must like spaghetti.

Parameters and configs. Everything in an A/B test is reduced to a parameter value, usually a string value. If your business logic has already the tested parameter as a variable or config value its very easy to implement the test.

No magic values If the parameter you are testing is hard coded in multiple places let this be your learning, do not use magic values (magic numbers are the most common mistake).

...

To read the full story you can continue on my website: A/B tests developers manual.

Remember to share and subscribe if you learned something new, Thanks!

Top comments (12)

Scott Tadman • Jan 4 '19

Don't forget basic statistics here. If you're developing an A/B test for a case that will be used millions of times per day then a hundred tests is not going to be conclusive. You need to test a statistically significant number of times relative to your actual use case.

Adrian B.G. • Jan 4 '19

Yes ofc. Beside the statistically relevance there are other factors like user aquisition, cohorts have to be very similar and there is also an error margin of at least 5% which should be taken into consideration when comparing the stats.

That is part of the business part of doing tests, I tried to cover only the technical details.

The web is full of articles on how to do proper testing, but it was lacking on the implementation details, so I wrote this story.

Ben Halpern • Jan 4 '19

Yep. This is a conversation I've definitely had numerous times.

Timkor • Jan 5 '19 • Edited

We use load balancing to distribute users over different GIT branches. That works actually quite nice. And works for both clean and spaghetti code!

The nicest advantage here is that you can change anything in a variant including backend parts. If, of course, your project contains both the front and backend part.

Not possible with cross variant testing though.

Adrian B.G. • Jan 5 '19

We use load balancing to distribute users over different GIT branches. That works actually quite nice.

I don't see how how you can achieve a good user experience and a relevant result based on that, I mean you cannot guarantee that the user will end up in that variant for their entire lifetime (across sessions) with just a LB.

Also you can induce technical bias, for example a backend or set of servers can have different latency, which will screw your business logic results. The owners will think that the feature A is better, but the users actually responded better because of the lower latency.

So I would not recommend this approach for a complex project. For small stuff or landing pages where the users have only 1 visit sure, nothing can go wrong.

Timkor • Jan 5 '19 • Edited

We use cookies to set the variant within the load balancer. The instances run within the same server so the only latency difference should be a result of changes to the code. Which is exactly what you want to test.

Also, how are cookies different than your approach?

Adrian B.G. • Jan 5 '19

I said users lifetime (across sessions). Cookies are session based/volatile, if the user uses other browser or cleans its cookies they will see different version, which result in the previous issues I mentioned.

I did not presented my approach because I don't know all the details, but all the tests we did in gaming were using a database for persistence, we had the luxury of having all our visitors authenticated users so we know who is in what test.

Depending on what is tested persistence is required or not. Usually our features require a few weeks of measuring its impact on the user behavior, without authentication this cannot be done properly.

Timkor • Jan 5 '19

Nevertheless, this would also be possible using load balancing, although it might require some customization.

Anyway, I use it for webshops, targeting new users. Most e-commerce websites do not have the luxery to use authentication before any actual conversion. Then this approach will work perfectly.

Good article though.

Adrian B.G. • Jan 5 '19

When testing an action like Conversion sounds great, you do not even care about user as a lifetime, but rather to which version reacted better/got converted.

Beside Round Robin, the LB/Proxy can also be used to make the cohorts, for example based on country, or limit an entire Test based on a property (example country, region, language, device, mobile vs web).