Revolutionize Product with AB Testing in R

Introduction

What is Ab testing?

When it comes to your typical product or engineering org, team members are often left wondering whether the thing they did had an impact, or whether the option they went with among many different design options was actual the best. As these organizations want to move towards data informed design decision, AB testing is first in line.

AB testing is a methodology of comparing multiple versions of a feature, a page, a button, etc. by showing the different versions to customers or prospective customers and assessing the quality of interaction by some metric (Click through, purchase, following any call to action, etc.). Any time you want to test multiple variations of something is a great use case for AB testing.

How to get started

It’s important to reduce the process into it’s different steps.

Design, plan & run your experiment

The first part of this is specifying your hypothesis or what you believe will happen. (More specifically, what you believe is the Alternative Hypothesis and the NULL Hypothesis assumes no difference between variants.)

Lets run through the things that you need to know or formulate before anything else:

  • Alternative Hypothesis: This is what you believe is going to happen. For example: Variant B is going to perform 20% better than Variant A.
  • NULL Hypothesis: This assumes that there was no difference between the variants.
  • Specify the dependent variable: This means decide what your feature needs to accomplish. If this is getting someone to click to the next page, put more things in their cart, purchase goods, or really anything else involving multiple variations of a product. Whatever that call-to-action or metric is, that’s what we’ll use to interpret the performance of our variations.
  • As you design any given experiment you may have a variety of independent variables that you want to use to predict Y or your dependent variable; in the case of an AB test, the explanatory variable of variation in the dependent variable is simply which version is shown to lead to what outcome.

One other thought when conducting your experiment is to roll the variants out to customers during the same time period, rather than staging different variants across different time periods. As much non-random variation there is in your sample, the less dependable your experiment will be. Time is a good example of something to standardize on rather than letting seasonality play a role in the outcome of your experiment. When you launch an experiment, launch it for any and all variants.

From here one of the most common questions people have is how many samples of each variant do I need to have statistically significant results?

In order to determine this we perform something called a power analysis. The idea of power analysis is that identify the requisite sample size based on a series of parameters; things like statistical power, p-value, the number of variants, and the size of the difference between the two group’s measurement, etc. The reason we do this is to make sure we don’t do an experiment so long that a ton of our customers had to see the worse version, but still long enough to justify our results.

I’ll break those down for you right here.

k – number of variants; at least two and as many as you want. One thing to keep in mind is the more variants, the more required data.
n – sample size per group: We’ll leave this as NULL, that’s what we’re solving for.
f – Observed difference between the groups that we want to validate: The bigger the difference, the smaller the required sample and the smaller the difference the greater the required sample to validate it. Using my sample data I generated, we have a minimum detectable distance of 18.7%
sig.level – significance level: This just means if the results validate your hypothesis, what likelihood of randomness can we live with; that’s typically .05.
power – statistical power: This pretty much means that if your hypothesis is true, what’s the probability that you’d accept it. The standard is typically .8.

Lets load the package pwr and use the pwr.anova.test to identify our requisite sample size.

I’ve created a couple dummy datasets to play with, one for the experiment & one pre experiment.

library(pwr)
pwr.anova.test(k = 2,
               n = NULL,
               f = .202,
               sig.level = 0.05,
               power = .8)

Also as a reference, I’ll include a code snippet of what you might do to understand the baseline. Baseline, being variant A.

pre_experiment_data %>% 
  summarize(conversion_rate = mean(call_to_action)) 

As you can see above, n or sample per variant is 98.

 click_data %>%    summarize(conversion_rate = mean(clicked_adopt_today)) 

This gave me the result of 0.502 which served as out baseline to compare with variant B’s 0.673

Validate & analyze your results

Once you’ve collected a sample of 114 per variant it’s time to validate your results.

It’s worth while to jump into your experiment data and review the conversion rate of your metric grouped by each variant yourself before piping it through a glm.

experiment_data %>%
  group_by(variant) %>%
  summarize(conv_rate = mean(call_to_action))

Just about what we had seen previously which is a good sign. Let’s validate this outcome using a GLM.

Another option is to chart the performance of the different variants over the course of the experiment. Below you can see that we group the experiment data by date and variant, then aggregating the average conversion. I think plot the average conversion per group over time. As this was sample data, I adjusted it for the sake of the visualization.

experiment_data %>%
  group_by(date, variant)%>%
  summarize(conv_rate = mean(call_to_action))%>%
ggplot(aes(x = date,
           y = conv_rate,
           color = variant,
           group = variant)) +
  geom_point() +
  geom_line()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

As you can see in the statement below, we are trying to interpret the result of the call_to_action by which variant of the experience they had. That tidy() call on the end there just cleans up your output. To use that make sure to download the broom package.

glm(call_to_action ~ variant,
    family = "binomial",
    data = experiment_data)%>%
  tidy()

First looking at the row for variantB you can see that the p.value is lower than our minimum .05.

Now comparing the estimate of our intercept with variantB we can see that our test had a far higher conversion rate than control.

Conclusion

From here we can accept our hypothesis! Now it’s time to implement your results and continue to iterate accordingly!

If you found this helpful, come check out the rest of my lessons at datasciencelessons.com!

Happy Data Science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: