Content aggregation

August 15th, 2011 § 0 comments

I was recently working with a website which had TONS of articles in a rather unordered fashion. The marketers wanted a way to be able to get a quick overview of which type of articles were being consumed. The problem was that they didn’t order the articles in categories and didn’t tag their articles with keywords. At the same time, the number of articles were in the thousands, why normal un-aggregated content drilldowns give little insight.

The way we solved this was pretty cool, which is why I want to write a small post about it. We built an algorithm that uses statistical/data mining/machine learning (pick the one which is currently most hyped) techniques to mathematically model the underlying semantic structure of the articles. We assume that each article consists of T topics (the number of topics, T, is chosen more or less arbitrarily) and we try to find the distribution of topics for each article. Articles which are similar to each other should have similar topic distributions.

We can then group together articles in k (k is a number also chosen by us) groups using clustering methods with either distribution similarity measures, or by representing the distribution as a vector in T dimensions and using a standard cosine similarity.

For each group, we can then find which words are associated to the group of articles. The idea behind this is that choosing the top 5-10 keywords will summarize the groups content. We can then send in the group information to Google Analytics or whatever data gathering tool you use as a virtual pageview.

As an example, say we have 1000 articles. We choose to model these as T=100 topics and k=30 groups. We assign every article to a group, and find 10 keywords which explains the group. When an article is read, send an extra pageview with the top 10 keywords (as a single string) as the page url, and set up profiles and filters to correctly report the number of pageviews.

There, we’ve just managed to reduce the dimension of the content drilldown from 1000 to 30. Much more manageable!

This technique can also be used for smaller websites where you, for whatever reason, want to represent a page in Google Analytics as a list of keywords instead of by page title (or URI *shrug*).

If there is enough interest, I can write a post on the process with all the math involved. I have a funny feeling my mail box will be empty the coming days…

Aggregated data rules

June 9th, 2011 § 3 comments

If I hear someone preach “stop looking at aggregated data” again on a web analytics seminar/conference I will throw a normal distribution with really really low variance at him.

At what point does data go from “aggregated” to “segmented”? Is all the collected data on the internet aggregated? Is your website just a segment of all the data out there? Is your converted traffic “segmented data” even though there are sub-groups (different sources, countries, genders)? What value does a full segmentation have?

You need to be looking at aggregate data in order to make any kind of inference. Perfectly segmented data is useless. I know as well as you what analysts mean when they say “don’t look at aggregate data”, but I think the discussion is annoying for two main reasons:

1) It’s wrong, as I’ve stated. To make inference and draw business decisions from the data, you need aggregated data.

2) It’s so blatantly obvious (the point of the statement). If you were a statistician, data miner or any other kind of data analyst do you think seminars would ever tell you that “it’s important to include variables in your models”? This is equivalent to saying “segment your data”. You can add variables to a model to hopefully describe the data better, without increasing the complexity too much. When you add too many variables (too many segments) the model becomes too complex and loses it’s intuition and makes it harder to do inference and come to any valuable insights.

So to sum it up: Segment your data, but keep it aggregated!

MVT part 4: Testing on a broken website

May 31st, 2011 § 0 comments

multivariate testing

This is not a part of the series, more of a footnote.

I want to highlight that A/B and multivariate testing is a small part of web analytics, and of near negligible importance to running to a successful website. If you have a website that doesn’t look horrible but it’s still not converting, no “button color” in the world will save it. Re-evaluating and analysing your marketing strategy (are you driving the correct audience?) and content (do we deliver the content that they want?) will push you further towards where you want to be.

Multivariate testing will never save you, just make your life a little bit better!

Book Tip: Categorical Data Analysis

May 24th, 2011 § 0 comments

After the series on multivariate testing, I thought it would be appropriate to give you a book tip if you want to delve deeper inte categorical variables (variables that lack a meaningful numerical interpretation).

Check out Agresti’s book Categorical Data Analysis. In my opinion it has a near perfect balance between theory, examples and intuition, and is relatively easy to read for a statistics book.

Oh and no I don’t have any affiliation with Wiley, the author or Amazon.

Uncertainty in web optimization

April 16th, 2011 § 0 comments

In a previous post we talked about the data we collect during multivariate testing. This post will delve deeper into one source of uncertainty in this data that causes our assumptions to be invalid. If you don’t have basic knowledge about the binomial distribution you should first read the post Multivariate testing part I: The data.

What is a statistical model?
A statistical model is a model that simplifies and maps the real world down to a few factors. A statistician tries to factor in relevant variables to explain the real world in a simple form, while keeping the model decently accurate. This becomes a balance between model complexity and usefulness and the statistician wants to find the sufficient amount of variables to explain a real world situation good enough.
What “good enough” means is subjective and depends on the application. For web analysts “good enough” is far less accurate than it may be for a biostatistician, for example!

While variable reduction is essential in any statistical model, failing to introduce important variables in a model can put the validity and the usefulness of the model at risk; and some variables are damn hard to spot!

Let’s look at a contingency table (columns are conversion and rows are button color):

contingency table
What are the sources of randomness in this data?

First of all, how did we select which visitor sees what button? You probably have a testing environment where there is a probability p of showing the blue button and a probability 1-p of showing the red button. So one source of randomness in the data comes from how the groups were selected.
How silly, we forced this randomness into the data ourselves by the experiment design! Why didn’t we just show the blue button to each other visitor and the red to the rest to avoid another source of randomness?

The reason is simple: We want to protect ourselves from systematic selection of test subjects! Even if you can’t intuitively explain why a sampling model would cause a systematic selection you:

a) can never be sure that you haven’t missed something. Relying on human intuition to find complex and underlying associations can sometimes be troublesome since our intuition is limited.

b) Even if you are correct in your assumption that your sampling choice has no systematic effect you will be at a weak position when defending your experiment to a critic. This may play a smaller role when optimizing a web site than it does for medical trials, but since it’s not harder to implement this randomness it should be best practice for web analysts as well.

So should we take this randomness into consideration in our model? Since the purpose of this randomness is to eliminate any systematic selection and to protect ourselves, it has little value to us and should not contain any information that is useful. Adding this would add complexity to the model without gaining usefulness so we choose not to model it.

The second source of randomness I want to talk about is the number of conversions for each button. This, on the other hand, is a randomness that we want to model since it gives us essential information. As we saw in the multivariate testing series a common way to model this would be to view each conversion count for the two buttons as coming from independent binomial distributions.The conversion rates can then be estimated through ML-estimation, which results in a conversion rate estimator which is asymptotically normally distributed.

Is this really sensible though? The binomial distribution is constructed by the sum of Bernouille trials. If you’ve forgotten, a Bernouille trial is a single experiment with a binary response, so in our case each visitor can be seen as a bernouille trial with probability p of converting.

The problem is that in today’s internet culture visitors come to a website for very different reasons. Some want to browse for product information, some want to find your physical stores, some want to buy a product and some… well you get the point. Intuitively, the propensity to convert is polarized; on one side we have those who never intend to buy something, while on the other side we have those who are determined to buy.

However in our binomial model for the conversion count we are making a huge assumption; all visitors are equally likely to convert. This is far from the real world, and potentially this assumption can cause huge errors in risk assessment and decisions based on our tests. Since there is an additional source of variation in our data that we can’t model, we almost always underestimate the variance and confidence intervals we construct around our tests.

A model which is closer to reality would be that each individual has his or her own binomial distribution with n=1 and some conversion rate p (a binomial distribution with n=1 is equivalent to a bernouille distribution). Modelling each individual not only results in an extremely complex model, but it also gives us no information which button performs the best! That is basically the definition of a sucky model!

So what can we do? If we can predict visitor intent we would largely dispose of this problem. We could then segment users on their intent and factor this into the model to gain a ton of information on the variability in the data.

Implementing segmentation on user intent is costly and complex and I honestly don’t know if there are any good standard solutions for this today. So the financial risk of a faulty decision will probably have to be substantial in order to motivate such an implementation for just a test. However if you have the means to make such a segmentation you have still created yourself a gold mine of business value.

A more realistic solution is to choose a different modelling distribution for the response. A commonly used distribution for modelling a varying conversion rate is the beta-binomial distribution which models the conversion rate for each individual as unkown or random.

If this problem is a real problem or not is dependent on the financial risk associated with the action you recommend based on the test. The bigger the risk, the more certain you and the company has to be that your test is accurate.

Final comments:
Segmenting on user intent will for most of us not be an option. The point of this post is more to highlight that we make a lot of assumptions in many of our tests, and that they influence the validity of them. In this case we assume a binomial distribution, which shows not to hold. If this is a real problem or not is entirely dependent on the application.

You have probably at some point used a spreadsheet like the one explained in a post by Avinash Kaushik to test for differences between two conversion rates. This spreadsheet uses the same method as I described in Multivariate testig part II: Associations where we assume that the conversion rates are independently normally distributed with the variance taken from the binomial distribution.
Use these simplified tests because they still give a good pointer to correct decisions, but keep in mind that statistical significance in these tests may be a result of underestimating the variance. So the actual margin of error will likely be higher than what you think.

Multivariate testing part III: Regression

March 23rd, 2011 § 1 comment

In part II we looked at how we can find simple associations between an explanatory variable (button color) and a response (conversion). This quickly becomes a royal pain in the behind and is unsuitable for general cases where we can have a lot of explanatory variables with several levels (blue, red…) each.

Regression models are used to model the relationship between a response and one or more explanatory variables. Essentially we’re constructing a function of the explanatory variables, with the response as an output, and our goal is to find the function that best describes the data!

The ordinary least squares regression is usually the first type of regression model you encounter when taking a basic statistics course. This is a linear model; but note that the linearity of linear models simply means that we assume linearity in the parameters, not in the data! For example y=a+b*x^2+e is still a linear regression, even though y grows exponentially as x increases. This is a common misconception.

So both of the regression lines (the blue line) below are fitted with ordinary least squares regression which is a linear model

linear regressionlinear model

 
 
 
 
 
 
 
 
 
 
(fig 1)

Ordinary linear regression assumes a normally distributed response variable, and also assumes constant variance. As you may remember, the conversion rate estimation is asymptotically normally distributed, so can we use that to model our data? Nope, unfortunately the ordinary regression also assumes that the variance is constant, and the variance of the conversion rate is a function of the conversion rate itself. There is also a structural problem with the ordinary regression in that it allows the predictions to take any value, and we need to limit our response to take on values between 0 and 1.

Luckily, some smart people have come up with Generalized Linear Models (GLM) which is just what we are looking for; a way to model different types of responses and loosen the variance assumption. The GLM consists of 3 main components. The random component is the response (conversion), the systematic component is the set of explanatory variables that explain the random component, and the link function is what connects the random and systematic components. One assumption that the GLM makes is that the response distribution should be from a natural exponential family, but this family includes most of the common distributions such as the normal distribution, binomial, multinomrial, poisson etc.

We now need to find a way to model the conversion rate. We need a link function that does not limit the values of the systematic components, but that does limit the values of the fitted model between 0 and 1. We also need to be able to account for non-linear behaviour, since when we are close to 0 or 1 it is not entirely crazy to assume that effects of the explanatory variable dimnish. For example, while a change of layout of your product page may push you from a 1% to a 2% conversion rate, you would probably not convert 100% of your customers with the new design if you were previously at 99% conversion rate.

One of the simplest function that meet these requirements is:

logistic regression

(eq 1)

And this is the logistic regression, where alpha and beta are the parameters we want to estimate!
This function is far not linear though, since the parameters are in the exponent of e.
We can however transform p(x) to create linearity in the parameters by:

logit

(eq 2)

This is the most common link function for the logistic regression model, and is called the logit.
Remember in Part II when we talked about odds? Look at the left hand side of the equation in fig 2 again; doesn’t it look awfully much like a log odds? That’s probably because it is one!
By exponentiating both sides of the equation we see that the odds are an exponential function of x and this gives us one way to interpret beta: a one unit increase in x has a multiplicative effect of exp(beta) on the odds of converting! We see this by

(eq 3)

Equivalently, a one unit increase in x has an additive effect of beta on the logarithm of the odds as shown in eq 2.
While not being the most intuitive interpretation, the magnitude and sign (+ or -) of gives us a lot of information how the variable effects the response.

There are near infinite ways to transform the data, but the logit gives us a decently easy interpretation of the results and also has a non-restricted range (the untransformed probability is restricted between 0 and 1) which makes it easier to model.

When dealing with categorical variables (such as color which lacks numerical meaning) like we are we have one parameter for each level of a factor. So in the categorical case we use dummy variables where xi=1 for row i and xi=0 otherwise. Because of constraints in parameter estimation one parameter for each factor will be redundent. This means that for our case of one predictor (button color) and two colors we would have 2-1=1 parameter + an intercept.
You don’t have to use only categorical variables, you can just as well include continious variables like the loading time of the page etc.

The parameters in the model are found with our trusty ML-method (Maximum-Likelihood).
Let yi represent the conversion count for combination i of the explanatory variables. In the case of our button color example, we would have y1 and y2 representing the number of conversions for each color. The counts {Y1,Y2,…,YN} are then independent binomials, and their joint probability distribution is proportional to the product of the N binomial functions. To keep it simple, I’ll focus on the case where we only have one explanatory variable. The likelihood function is then given by:

likelihood

(eq 4)

This is the function we want to maximize with respect to alpha and beta. To find the maximum we start by using the definition of p(x) given above and take the log of the likelihood function:

log likelihood

(eq 5)

Starting to look a bit scary… So lets take the derivatives of this with respect to alpha and beta!

log likelihood derivative

(eq 6)

There we go, much more manageable! Now we can clearly see that the estimates for alpha and beta are… no I’m kidding, it’s still too complex to solve by hand. In fact, we won’t be able to solve this by hand no matter what we do with it. Kind of an anti-climax but I DID say it was theory…
To find the parameters you can use numeric methods such as the EM-algorithm or gradient descent.

To calculate the covariance matrix of the parameters (the matrix containing all parameters variances and their covariances) we take the inverse of the matrix whos elements consist of the negative of the second derivative of the log likelihood function with respect to both alpha and beta for each combination of the parameters. My own head exploded a little bit by reading that, so I don’t fault you if you feel slightly confused. To show you, each cell (1,b) of the information matrix is calculated by:

information matrix cell(eq 7)

Note that Xa,b=Xb,a
The covariance matrix is then given by:

covariance matrix(eq 8 )

We can use the assumption that the parameters are asymptotically normally distributed, with mean equal to it’s ML-estimate and variance given by the covariance matrix. This lets us do inference on each individual parameter using the same methods as in Part II, and also on the model as a whole (or parts of the model).

Until now we’ve only been looking on main effects, not any interaction effects between variables. It’s not far fetched to think that some combinations of factor help or break each other. For example both the pink text and red button may individually perform well, but together they perform pretty poorly. This is highly relevant for web testing, since the combinations of text/font/color/layout or whatever you’re changing can have major effects.

The model with interaction term is written as:

interaction term(eq 9)

If x=1 for red button and 0 otherwise, and if z=1 for pink text and z=0 otherwise exp(beta3) would be the multiplicative effect on the odds of converting given that both the red button and pink text are active at the same time.

This concludes a basic overview of the theory behind logistic regression. I’m aware that the concepts here are probably on too high level for complete beginners to grasp everything, but once you have read a book on GLMs or categorical data this post could be used as reference for your web based work in the future.

Multivariate testing part II: Associations

March 7th, 2011 § 0 comments

In the first post of this series, Multivariate testing pat I: The data, we constructed a contingency table that summarized our test. In this post we will look at how to test for associations between variables and differences in conversion rates.

While this post may require some basic statistics knowledge the main focus will be on deriving an easy and easy to remember way to get a quick feel for the association between variables. This post will not be highly technical or advanced, and many of the concepts discussed should be able to be used and interpreted without being a math geek. If you have any questions or need anything explained further don’t hesitate to throw me an email or leave a comment!

The last part of this series, Part III, will handle more complex tests and general cases, why I leave multi-dimensional and large contingency tables out of this post.

Klas the analyst has just finished running a test on his website and summarized the results in a contingency table:

Odds and odds ratio
You’ve probably heard of odds being thrown around in betting. Statisticians talking about odds are not necessarily losing their money on horses, but may just as well be working with the odds in a contingency table.

Usually you see odds in the form of 1:3 but when working with odds in statistics you usually just represent the odds as a single numeric value. For the probability p of a conversion we define the odds as:

odds ratio

Turning it around we can extract the probability of success from the odds by

probability from odds ratio

So if you have the odds 1/3 (corresponding to the betting representation 1:3) the probability of success would be (1/3)/(1+1/3)=0.25.

With this knowledge we can already construct a simple association statistic between X and Y by looking at the ratios of the odds of converting given button color! The odds ratio is defined as:

odds ratio

Which after some crunching can be written as:

odds ratio

Note that this is the true, unknown, odds ratio! When working with parameters we only estimate this true value using our limited sample, and it’s of (theoretical) importance to differ true values from estimates. Using cell frequencies instead of probabilities the sample odds ratio is:

sample odds ratio

Which is easy to remember since it’s just the cross product of 4 cells!
For Klas test the odds ratio would be:

odds ratio example

Which means that the odds of converting is 52% higher for the blue button than for the red button.
Important: This is not the same thing as a 52% higher chance of converting!

An odds ratio of 1 signals independance between the variables, higher values correspond to a higher odds of converting for the blue button compared to the red. The odds ratio can of course never be negative but it can go to infinity when the odds of converting for the red button equals 0 which happens when n21=0. This scenario is unlikely when working with multivariate testing on the web and if you encounter it you either have too small of a sample or you should double check your tracking and implementation of the test. The odds ratio is undefined if both column values or row values are 0, but this should never matter for us except for at the very early stages of the test or if you have way too many combinations in proportion to the traffic volume.

For large n the odds ratio converges to an approximate normal distribution, but for smaller values of n the distribution is highly skewed, so we need a fairly large sample size in order to use the normal approximation. Another problem with the odds ratio is that it can not take negative values, so for values near 0 the normal approximation of the odds ratio becomes even more skewed.
The logarithm of the odds ratio follows an additive structure rather than a multiplicative one like the un-transformed odds ratio, and thus converges faster towards the normal approximation. It also does not suffer from the problem of limitations in what values it can take. Therefor the log odds ratio is more sensible to use when doing inference.

The variance of the sample log odds ratio is defined as:

standard error of log odds ratio

With an estimate of the standard error, the odds ratio and the normal approximation we have a recipe for an easy to use confidence interval of the association between X and Y!

To start constructing the confidence interval we first need the mean of the distribution, which is simply the estimation of the log odds ratio: log(1.52)=0.42.
We then need the appropriate quantile of the normal distribution. If we want a 95% confidence interval we want to find the point at where 2.5% of the distributions mass is below the interval and 2.5% of the total mass is above the interval. These points are estimated to be at 1.96*S.E (1.96 times the standard error) and -1.96*S.E.
If you want a custom confidence level of your interval you can integrate the normal density function and find the correct values, which is a pain in the butt, or you can look it up in a standard normal table.

Constructing a 95% confidence interval for Klas we then get:

confidence interval example

So what can we make of this? We see that the interval includes the ratio 1, so we can’t definately say that there is a statistically significant difference between the two conversion rates yet. From a business standpoint though we can say that “the test suggests that the blue button has a higher conversion rate, but there is a non-neglible probability that the conversion rate is actually higher for the red button”.

You don’t need statistical significance to implement a change, but it helps to assure you that the button doesn’t end up producing worse results. So remember that even though you have a sample conversion rate for a new button that is higher than your original conversion rate, you also need to look at the interval of the odds ratio in order to make a correct risk assesment.

Single point tests, yuck. Confidence intervals, yay!
Remember in the first post how I mentioned that the ML estimation of the binomial parameter follows a normal distribution? Couldn’t we just have used those estimates to run a simple hypothesis test for to see if there is a significant difference between two or more conversion rates? You sure could have!

If you have taken a basic statistics course you have probably done standard significance tests, like the t-test,  where you end up with a p-value which tells you if you should reject your null hypothesis or not. In our case a fitting null hypothesis would be that the difference between the two conversion rates is 0.

The problem with tests like this is that we’re testing for a difference which we know is there! As the sample size grows, it becomes increasingly improbable that we won’t find a difference between the conversion rates. Without testing the magnitude of the difference this type of test becomes rather meaningless for us.

My advice to you would be to always construct confidence intervals for your associations between variables or differences between conversion rates. It gives you much more ground to stand on when making business decisions based on the test, rather than just running a simple significance test.

If you’re interested in more accurate tests and confidence intervals you could read up on the likelihood ratio test. This test uses more information and looks at the ratio of the likelihood function with our estimated p and the likelihood function under the null hypothesis and draws conclusion from that statistic. This test uses more information and makes for a better test, but it’s added complexity makes it harder to use. Throw me an email if you’re interested in learning how it works!

The odds ratio is great for small tables but getting an overview over larger tables quickly becomes troublesome, and even more so when we add a few more dimensions to our tables. For a two-way table (a table with 2 variables/dimensions, such as ours) the number of odds ratios needed to explain the entire data is (I-1)(J-1). If we had 5 different buttons and 2 responses this would mean that we need 4 odds ratios to explain all the associations. Add a few more variables with a few levels each and you’ll have a lot of odds ratios to take into consideration.

2x2xK tables
Speaking of which, what if we have more than one element we wanted to vary in our test? Say we also wanted to vary the text size in the product description. We could represent this as a 3-dimensional matrix, but things get tricky when you have 4 or 5 or 20 variables.

Instead we partition the tables and speak of marginal vs condititional odds ratios. If we for instance had X=button, Z=textsize and Y=converted we could decide to investigate the XY association while controlling for Z. We then get one 2×2 table for each level of Z.
The XY conditional odds ratios would then look like

conditional odds ratio

The marginal odds ratios would be

marginal odds ratio

If X and Y are independent at level k of Z they are said to be conditionally independent at level k. When X and Y are independent for all levels of k they are said to be conditionally independent given Z. That is, X and Y have no association given the value of Z. You may be tempted to draw the conclusion that this naturally must lead to X and Y being independent in the marginal case as well, but this assumption is not true. X and Y can be marginally dependent while they are conditionally independent for all values of Z, and vice versa. This may seem counter-intuitive but I will talk more about this in a later post where I will talk about the importance of segmentation in web analytics and statistics.

Final comments:
This post has given you a quick and easy way to find associations between two variables, but for larger and more complex tests these methods become cumbersome and impractical.

If there are two things I really want you take away from this post they are:

1) Confidence intervals: Make love to them
2) Significance without magnitude is meaningless

In the next post I will go through how to handle more general cases of binary modelling using logistic regression models. These models make it easy for us to model the effect of many variables, and the interaction between them!

Multivariate testing part I: The data

February 21st, 2011 § 1 comment

A/B and multivariate testing; one of the powerful and frequently used tools of a web analyst. But what the heck goes on behind the scenes and why does it work? This three part series will discuss the data, the hypothesis tests, regression models and how to model different types of data.

You should probably have some knowledge of statistics or probability theory to fully appreciate this post. If you find yourself scratching your head about something, feel free to e-mail me your questions or write a comment. This post also assumes you are familiar with web analytics and the purpose of multivariate testing in a web environment.

One of the problems with multivariate testing in the web analytics community is that it has become synonymous with “optimizing conversion rate”. This is probably one of the reasons why conversion rate hysteria flourishes among web analysts and e-commerce CEOs. Even though we’ll be talking about testing conversion rate in this series (go figure) we’ll move on to testing other responses such as # products in a basket, average order value, loading time of page, # uploaded pictures to a gallery etc in a later posts. But before we begin testing and modeling the data, we need to understand it…

Contingency tables
Throughout this series I will be using an example where the web analyst Klas wants to run an A/B test to see if there is a difference in the conversion rate by changing the color of the buy button from red to blue. He sets up the test by randomly showing the blue button to 50% of the visitors and the red button to the other 50%. It doesn’t matter what proportions you choose, but it is important to keep it random. For example you don’t want to show the red button 80/20 to females and 20/80 to males since this gives us an association between button occurence and gender which could cause problems if we don’t factor in gender in the model.

We can show the results from this test in a contingency table

To discuss contingency tables more efficiently we should go over some basic statistics lingo. We denote the column variable (converted: yes/no) as Y and the row variable (button:blue/red) as X. These are called categorical variables and differ from numerical variables in that their values have no meaningful numerical interpretation. A categorical variable can take on several levels, which are represented by the rows and columns in the contingency table.

The cell frequencies are denoted as nij where i represents the row number and j the column number. To further clarify n11 is the number of visitors who saw the blue button and converted while n22 is the number of visitors who saw the red button and didn’t convert. We write the total count of visitors seeing the the blue button as n1+ , which means “the sum over all columns for row 1”.

The table with these notations looks like

Contingency table

Still with me? Lets start filling this bad boy with real data!
When a visitor arrives at your page one of two things can happen; she converts or she doesn’t convert. A single “experiment” like this is called a Bernouille trial and the data, which can only take two values such as true or false, yes or no, converted or didn’t convert, is called binary. After any given sample size you will have a count on how many of these visitors converted. This count has an uncertainty to it even if the true conversion rate is known, and can be modeled by a probability distribution, the binomial distribution. For those of you who don’t know what a probability distribution is I think Wikipedia can explain it as good as anyone so I’ll shamelessly send you there!

The binomial distribution
The binomial distribution lets us model count data and determine the probability of a certain outcome; for example the probability of us seeing 10 conversions after 1000 visits with a 2% conversion rate.

The binomial distributions probability mass function (pmf, which is basically the equivalent to the continous distributions density function) has the parameters n=sample size and p=probability of success (conversion rate)

binomial pmf

We can plot the function with parameters n=1000 and p=0.01 against k=1,2,…,20 which will show you how probable different values of k will be for the given parameters

Quiz: How do you calculate the conversion rate?
Silly question really, I bet every one of you knows that you divide the number of conversions by the total number of visits right? But what you are doing here is something much cooler: you’re maximizing the binomial pmf, given sample size and conversions! It makes sense doesn’t it, choosing the p that maximizes the probability of us observing the outcome we observed?

Maximum-Likelihood
To find the p that maximizes the pmf we can set the derivative of the mass function equal to 0 and solve for p. Since the p that maximizes the mass function also maximizes the logarithm of the function, and since the logarithm is simpler to compute, we’ll maximize the log of the pmf instead:

By plotting the function f(p) for n=1000 and k=20 you can visually confirm this

log of binomial function

This estimation method is, not entirely surprisingly, called maximum-likelihood estimation. The estimation of the parameter p will of course also be a random variable, but one of the great properties of the ML method is that the resulting estimator will be normally distributed. The normal is great to work with for several reasons but one of the things I love about it is that most of you are familiar with it’s shape and have probably encountered it at some point.

We can get a visualization of the ML estimation of p by constructing a binomial random variable with a given n and p, then drawing 10000 samples from it and estimating the p with the ML method for each sample. You will then get 10000 estimated values of p which we can plot in a histogram to get get an idea of how the estimation is distributed

Looking at this histogram you can ask yourself what happens to the symmetry of the normal distribution when p is even smaller, like 0.5-1%? The short answer is that it gets increasingly skewed as p approaches 0, and the normal approximation doesn’t work all that well. This is really not uncommon for the type of conversion rates you see when working with ecommerce purchases and even at 2% like the histogram above you clearly see that the distribution is skewed. You need to keep this in mind when working with the estimation of conversion rates, but you can still generally apply the normal approximation unless your conversion rate is really low.

All from one – joint probabilities
Until now we’ve looked at individual cell frequencies, but we may also want to look at the joint probability distribution of multiple cell frequences. This models multiple cell frequencies as generated from a single random variable.

The binomial distribution only handles response variables which are binary, and now we want 4 different values to be generated from it. So we need a more general distribution which doesn’t limit the number of possible outcomes to 2.

Let pij denote the probability of an event resulting in the cell at row i and column j, in other words


With the restriction

The cell frequncies {nij} for i=1,2 and j=1,2 follow a multinomial distribution with parameters {pij}. You can think of a multinomial distribution as a binomial distribution, but instead of only having 2 possible outcomes the multinomial has k possible outcomes. So for example if you flip a coin 100 times, the number of heads will be binomially distributed. If you roll a dice, which has 6 possible outcomes, 100 times the number of sixes will be…. binomially distributed as well! But the total outcome {n1, n2,…,n6 } will follow a multinomial distribution with the parameters n=100 and p={1/6, 1/6, 1/6, 1/6, 1/6}. Wait what? There are 6 possible outcomes but only 5 probability parameters? This is because we have a restriction on these probabilities, in that they must sum to 1. This makes one of the parameters redundant, since it is entirely dependent on the rest of the parameters.

The multinomial distribution is one of those distributions which you should know a little bit about, if nothing else o be able to impress girls by giving them the probability of certain events when rolling a dice multiple times. That will surely set the mood.

happy guy rolling a dice

Model selection
Alright cool, so we model the individual cell frequencies as binomial random variables and the joint probabilities as multinomial? Nope, not necessarily. One thing you need to know about statistics is that there are usually more than one way to model the data. The probability distribution of the cell frequencies will depend on how you set up the test, what you want to model and how you sample the data. Yay for uncertainty!

Say we’re setting up the test and we know which factors we are interested in and what the response is (conversion rate) and we will be logging the results in a contingency table as shown in figure 1.At this point the sample size n is unknown, and is thus a random variable. In this scenario we can’t model the cell frequencies as binomial yet, instead we can model them as independently poisson distributed! The poisson model views the sample size n as random rather than fixed, while the binomial and multinomial distribution requires n to be fixed. We’ll get back to the Poisson distribution in more detail in later posts, but it can be good to know that if Y is a poisson random variable, then Y conditioned on the sample size N,
P(Y | N=n), is multinomial.

Now consider the case where we have already run the test and have a full contingency table as figure 2. The number of visits, n, is now known and considered fixed. We can choose to view the row and column totals as random, while the sample size n is fixed. We can then model this data as coming from a multinomial distribution with I*J parameters (in our case 2*2=4 parameters).

We can also choose to view the row or column totals as fixed. In this case we look at the conditional probability at different levels of the factors. So for example if we decide to dismiss the randomness of the explanatory variable X and view it as fixed, we will have I multinomial distributions with J parameters. Again, in our case we would have I=2 distributions with J=2 parameters, so we would be looking at 2 binomial distributions. The joint probability in this model would be the product of the 2 binomial distributions.

Another way to do it is to run the test for say a week. We then sample 1000 visits who converted and 1000 visits who didn’t convert and look at the number of times the blue and red buttons were shown for each segment. This turns things around a bit, instead of looking at the conversion as the response variable we’re actually looking at the button color as the response. By the sampling design the column totals are fixed, so we have J multinomial distributions with I parameters. Again, in our case I=2 and J=2 so we have 2 binomial distributions but this time with a different response. The problem with this sampling design is that without the overall conversion rate of the population we can only give information on the probability of X given Y and not the more useful case of Y given X. However if we know the overall probability of converting, P(Y) unconditioned, we can actually get the reverse conditional probability as well by bayes theorem

bayes theorem

As you can see there are several ways that we can conduct this experiment and several sampling models to choose from, so which one should we choose?

For our case of examining the difference in conversion rates between the blue and red buttons I would choose a model where we regard the row totals as fixed and the total sample size as fixed. This way we get two separate binomial distributions, one for the red and one for the blue button which is also very intuitive. Besides, since we control the button color the randomness of X is really not meaningful for us. This is the most common model to use when working with experiments like this.

You should also know that you don’t have to set it up like a contingency table, but what I’ve said still applies. Contingency tables are just a good way of summarizing results, and are easy and intuitive to work with. It’s also easy to do inference with a contingency table, but I’ll save that for the next post!

I bet you can’t wait…

Yet another web analytics blog

February 19th, 2011 § 0 comments

This will, without a doubt, be one of the most boring blogs on the internet. No really, I’m serious. I dare you to actively read this blog for 6 months. Your soul will explode.

There are a lot of great blogs about web analytics, one of my personal favorites being the very inspiring Occam’s Razor by Avinash Kaushik. Most of these blogs are great at capturing the business side of analytics and will help you evolve your qualitative web analytic skills and creative thinking, and get you a long way in web analytics.

This blog won’t be about the fancy front-end side of web analytics. No siree bob, this blog will be about the gritty behind the scenes theory of the data and the methods we use. And that’s my bounce rate sky rocketing right there!

In all seriousness, the aim of this blog is to cater to web analysts who wish to learn more about statistics relating to web analytics and to get ideas of more statistically complex methods to use when creating value for web companies and their users. Some of the posts and methods I will talk about don’t necessarily fall in to the category of your typical web analytics, but that’s perfectly fine. You can take whatever you want from this blog and incorporate it in your web analytics process.

sad reader

Enjoy… At least try to!

My twitter

Contact

mike @ dorado.se