- Posted by Intent Media 16 Jan
- 0 Comments
Most A/B testing happens within the classical statistical framework that involves p-values, statistical significance and confidence intervals. More recently though there has been much interest in using Bayesian inference techniques in setting up and interpreting A/B tests. Terms such “Bayesian A/B tests” and “Bayesian bandits” have been used to describe these techniques. I will not reinvent the wheel here and as an introduction to the topic refer you to other sources . In particular, I would recommend Sergey’s post as a minimum prerequisite to follow the rest of this post.
The most common example discussed in the context of Bayesian A/B tests is that of binary metrics such as click through rate (CTR) and conversion rate, which is natural given how ubiquitous these metrics are. The case where ads are either clicked or not, or when users either convert or don’t can be modelled as independent bernoulli trials and the likelihood of data under this model follows a binomial distribution. The beta distribution comes in handy as a conjugate prior and story is soon wrapped up as demonstrated in Sergey’s post.
Metrics however do not exist in a vacuum and must be tailored to the way the user interacts with the product in question. As a result, one often ends up with many standardized metrics that everyone understands such as conversion rate, ROI, etc and a few custom metrics that are unique to the product.
In the rest of this post, I introduce one of the keys metrics here at Intent Media – Clicks per page view (CPPV) – which does not have a bernoulli distribution. I show how the Bayesian A/B testing framework can adapted to this metric.
Our Metric: Clicks per page view (CPPV)
A typical Intent Media page view in response to user’s search for a flight might look like this.
Each page view has “impressions” from multiple advertisers and a user is encouraged to comparison shop between any combination of advertisers. So a given page view could result in anything from 0 up to N clicks, where N is the number of ads in each page view (typically N ranges from 3 to 5). Since the advertiser pays for each click received, the number of clicks per page view (CPPV) is one of our key metrics.
Clearly CPPV does not follow a bernoulli distribution. One can certainly derive a transformation that has a bernoulli distribution. For instance, if we instead measured whether a page view receive at least 1 click, that would have a bernoulli distribution. While useful, this metric fails to capture the number of clicks received conditional on receiving at least 1 click and is therefore inadequate. So we are back to working with CPPV. But is having to work with CPPV really that bad? Well, not really.
CPPV as the categorical distribution
Consider the die roll. Receiving up to N clicks can actually be thought of as a roll of a loaded die. Given K rolls (page views) each of which can result in one of 6 faces (N clicks), the distribution of outcomes follows the well-known and widely studied multinomial distribution. This means that the likelihood function i.e. Prob(Data | model parameters) can be modeled as coming from a multinomial distribution. Note that for this to work we must make the (very reasonable) assumption that only one click per ad in a page view is considered valid.
As I mentioned in passing earlier, working with bernoulli distributed metrics is convenient since the resulting likelihood function – a binomial distribution – has a conjugate prior in the beta distribution. The resulting posterior, also a beta distribution is analytically tractable and easy to sample from. Fortunately, the picture doesn’t change much with a multinomial likelihood distribution.
Enter the dirichlet distribution. Much like the beta prior is a conjugate prior for the bernoulli distribution, the dirichlet distribution is the conjugate prior for the multinomial distribution. And this is not at all incidental since the dirichlet itself is a multivariate generalization of the beta distribution.
Bayesian inference for CPPV
And the rest as they (never) say is just works by analogy to the beta-bernoulli case. Here is some python code that demonstrates this. Imagine we have 2 variants A and B. The goal of inference is to estimate the probability of A being better than B i.e. P( A > B ). One can often derive an analytical closed form solution  from first principles – if A and B have known functional forms and P(A > B) is the area under the curve in the region where A > B. More often than not analytical solutions are not available (no functional form or messy integrals) and sampling methods save the day. So I instead sample from A and B, pair them and empirically count the fraction of pairs where A > B. (it doesn’t exactly matter how you pair them since samples are independent)
The code is setup as follows.
- I track the performance of A and B through N-dimensional vectors (1 per variant) that contains the number of page views with 0 up to N-1 clicks.
- To start with the page view counts for A and B are the same and as one would expect, doing inference gives us 0.5 i.e. A and B are evenly placed.
- To illustrate how P(A > B) changes, I keep A constant and incrementally make B worse. At each increment, I estimate P(A > B) and notice that this quantity converges to 1, which means once B is ‘different enough’ from A, we are certain beyond any doubt that A is better.
This post dealt with how to do inference on data from an A/B test when the metric is not the standard bernoulli. In a future post, I will outline how we can use the estimate of P(A > B) in a real-world experimental setup.
 – http://engineering.richrelevance.com/bayesian-ab-tests/  – http://developers.lyst.com/data/2014/05/10/bayesian-ab-testing/  – http://www.evanmiller.org/bayesian-ab-testing.html
Sharath Rao is a data scientist at Intent Media. His interests are in using machine learning to help people be better people and machines be better machines. You can find tweeting and blogging life and work.