A/B and multivariate testing; one of the powerful and frequently used tools of a web analyst. But what the heck goes on behind the scenes and why does it work? This three part series will discuss the data, the hypothesis tests, regression models and how to model different types of data.
You should probably have some knowledge of statistics or probability theory to fully appreciate this post. If you find yourself scratching your head about something, feel free to e-mail me your questions or write a comment. This post also assumes you are familiar with web analytics and the purpose of multivariate testing in a web environment.
One of the problems with multivariate testing in the web analytics community is that it has become synonymous with “optimizing conversion rate”. This is probably one of the reasons why conversion rate hysteria flourishes among web analysts and e-commerce CEOs. Even though we’ll be talking about testing conversion rate in this series (go figure) we’ll move on to testing other responses such as # products in a basket, average order value, loading time of page, # uploaded pictures to a gallery etc in a later posts. But before we begin testing and modeling the data, we need to understand it…
Contingency tables
Throughout this series I will be using an example where the web analyst Klas wants to run an A/B test to see if there is a difference in the conversion rate by changing the color of the buy button from red to blue. He sets up the test by randomly showing the blue button to 50% of the visitors and the red button to the other 50%. It doesn’t matter what proportions you choose, but it is important to keep it random. For example you don’t want to show the red button 80/20 to females and 20/80 to males since this gives us an association between button occurence and gender which could cause problems if we don’t factor in gender in the model.
We can show the results from this test in a contingency table

To discuss contingency tables more efficiently we should go over some basic statistics lingo. We denote the column variable (converted: yes/no) as Y and the row variable (button:blue/red) as X. These are called categorical variables and differ from numerical variables in that their values have no meaningful numerical interpretation. A categorical variable can take on several levels, which are represented by the rows and columns in the contingency table.
The cell frequencies are denoted as nij where i represents the row number and j the column number. To further clarify n11 is the number of visitors who saw the blue button and converted while n22 is the number of visitors who saw the red button and didn’t convert. We write the total count of visitors seeing the the blue button as n1+ , which means “the sum over all columns for row 1”.
The table with these notations looks like

Still with me? Lets start filling this bad boy with real data!
When a visitor arrives at your page one of two things can happen; she converts or she doesn’t convert. A single “experiment” like this is called a Bernouille trial and the data, which can only take two values such as true or false, yes or no, converted or didn’t convert, is called binary. After any given sample size you will have a count on how many of these visitors converted. This count has an uncertainty to it even if the true conversion rate is known, and can be modeled by a probability distribution, the binomial distribution. For those of you who don’t know what a probability distribution is I think Wikipedia can explain it as good as anyone so I’ll shamelessly send you there!
The binomial distribution
The binomial distribution lets us model count data and determine the probability of a certain outcome; for example the probability of us seeing 10 conversions after 1000 visits with a 2% conversion rate.
The binomial distributions probability mass function (pmf, which is basically the equivalent to the continous distributions density function) has the parameters n=sample size and p=probability of success (conversion rate)

We can plot the function with parameters n=1000 and p=0.01 against k=1,2,…,20 which will show you how probable different values of k will be for the given parameters

Quiz: How do you calculate the conversion rate?
Silly question really, I bet every one of you knows that you divide the number of conversions by the total number of visits right? But what you are doing here is something much cooler: you’re maximizing the binomial pmf, given sample size and conversions! It makes sense doesn’t it, choosing the p that maximizes the probability of us observing the outcome we observed?
Maximum-Likelihood
To find the p that maximizes the pmf we can set the derivative of the mass function equal to 0 and solve for p. Since the p that maximizes the mass function also maximizes the logarithm of the function, and since the logarithm is simpler to compute, we’ll maximize the log of the pmf instead:

By plotting the function f(p) for n=1000 and k=20 you can visually confirm this

This estimation method is, not entirely surprisingly, called maximum-likelihood estimation. The estimation of the parameter p will of course also be a random variable, but one of the great properties of the ML method is that the resulting estimator will be normally distributed. The normal is great to work with for several reasons but one of the things I love about it is that most of you are familiar with it’s shape and have probably encountered it at some point.
We can get a visualization of the ML estimation of p by constructing a binomial random variable with a given n and p, then drawing 10000 samples from it and estimating the p with the ML method for each sample. You will then get 10000 estimated values of p which we can plot in a histogram to get get an idea of how the estimation is distributed

Looking at this histogram you can ask yourself what happens to the symmetry of the normal distribution when p is even smaller, like 0.5-1%? The short answer is that it gets increasingly skewed as p approaches 0, and the normal approximation doesn’t work all that well. This is really not uncommon for the type of conversion rates you see when working with ecommerce purchases and even at 2% like the histogram above you clearly see that the distribution is skewed. You need to keep this in mind when working with the estimation of conversion rates, but you can still generally apply the normal approximation unless your conversion rate is really low.
All from one – joint probabilities
Until now we’ve looked at individual cell frequencies, but we may also want to look at the joint probability distribution of multiple cell frequences. This models multiple cell frequencies as generated from a single random variable.
The binomial distribution only handles response variables which are binary, and now we want 4 different values to be generated from it. So we need a more general distribution which doesn’t limit the number of possible outcomes to 2.
Let pij denote the probability of an event resulting in the cell at row i and column j, in other words

With the restriction

The cell frequncies {nij} for i=1,2 and j=1,2 follow a multinomial distribution with parameters {pij}. You can think of a multinomial distribution as a binomial distribution, but instead of only having 2 possible outcomes the multinomial has k possible outcomes. So for example if you flip a coin 100 times, the number of heads will be binomially distributed. If you roll a dice, which has 6 possible outcomes, 100 times the number of sixes will be…. binomially distributed as well! But the total outcome {n1, n2,…,n6 } will follow a multinomial distribution with the parameters n=100 and p={1/6, 1/6, 1/6, 1/6, 1/6}. Wait what? There are 6 possible outcomes but only 5 probability parameters? This is because we have a restriction on these probabilities, in that they must sum to 1. This makes one of the parameters redundant, since it is entirely dependent on the rest of the parameters.
The multinomial distribution is one of those distributions which you should know a little bit about, if nothing else o be able to impress girls by giving them the probability of certain events when rolling a dice multiple times. That will surely set the mood.

Model selection
Alright cool, so we model the individual cell frequencies as binomial random variables and the joint probabilities as multinomial? Nope, not necessarily. One thing you need to know about statistics is that there are usually more than one way to model the data. The probability distribution of the cell frequencies will depend on how you set up the test, what you want to model and how you sample the data. Yay for uncertainty!
Say we’re setting up the test and we know which factors we are interested in and what the response is (conversion rate) and we will be logging the results in a contingency table as shown in figure 1.At this point the sample size n is unknown, and is thus a random variable. In this scenario we can’t model the cell frequencies as binomial yet, instead we can model them as independently poisson distributed! The poisson model views the sample size n as random rather than fixed, while the binomial and multinomial distribution requires n to be fixed. We’ll get back to the Poisson distribution in more detail in later posts, but it can be good to know that if Y is a poisson random variable, then Y conditioned on the sample size N,
P(Y | N=n), is multinomial.
Now consider the case where we have already run the test and have a full contingency table as figure 2. The number of visits, n, is now known and considered fixed. We can choose to view the row and column totals as random, while the sample size n is fixed. We can then model this data as coming from a multinomial distribution with I*J parameters (in our case 2*2=4 parameters).
We can also choose to view the row or column totals as fixed. In this case we look at the conditional probability at different levels of the factors. So for example if we decide to dismiss the randomness of the explanatory variable X and view it as fixed, we will have I multinomial distributions with J parameters. Again, in our case we would have I=2 distributions with J=2 parameters, so we would be looking at 2 binomial distributions. The joint probability in this model would be the product of the 2 binomial distributions.
Another way to do it is to run the test for say a week. We then sample 1000 visits who converted and 1000 visits who didn’t convert and look at the number of times the blue and red buttons were shown for each segment. This turns things around a bit, instead of looking at the conversion as the response variable we’re actually looking at the button color as the response. By the sampling design the column totals are fixed, so we have J multinomial distributions with I parameters. Again, in our case I=2 and J=2 so we have 2 binomial distributions but this time with a different response. The problem with this sampling design is that without the overall conversion rate of the population we can only give information on the probability of X given Y and not the more useful case of Y given X. However if we know the overall probability of converting, P(Y) unconditioned, we can actually get the reverse conditional probability as well by bayes theorem

As you can see there are several ways that we can conduct this experiment and several sampling models to choose from, so which one should we choose?
For our case of examining the difference in conversion rates between the blue and red buttons I would choose a model where we regard the row totals as fixed and the total sample size as fixed. This way we get two separate binomial distributions, one for the red and one for the blue button which is also very intuitive. Besides, since we control the button color the randomness of X is really not meaningful for us. This is the most common model to use when working with experiments like this.
You should also know that you don’t have to set it up like a contingency table, but what I’ve said still applies. Contingency tables are just a good way of summarizing results, and are easy and intuitive to work with. It’s also easy to do inference with a contingency table, but I’ll save that for the next post!
I bet you can’t wait…