• Posted by Intent Media 25 Nov

Designing Experiments with Continuous Inputs


We are continually running experiments to optimize publisher revenue and user experience. Many of these are A/B tests that involve two or more discrete alternative treatments. Others are experiments that involve a continuous set of possible quantitative treatments. Here, we will consider the latter case.

Example Application

As an example application, consider bidding in a sequence of auctions for a set of similar items. The more that one bids in a given auction, the more likely one will be to win it. But the result of a given auction may depend on the bidders involved and on variations between the items in the auction. So our observation of a given auction result may be a noisy one.

Let’s say that we want to model the probability of winning the auction as a function of the bid: $P(Win sim Bid)$. Running experiments in this situation will cost money as we will need to place a number of possibly suboptimal bids to learn this function. So we would like to run as few experiments as possible.

One way we could approach this problem is to discretize the bid space. By doing so, the resulting experiment will be similar to the A/B tests mentioned above: we could divide the bid space into a number of bins and run enough experiments until we know with confidence the value of $P(Win sim Bid)$ for each bin.

Alternatively, if we have a good idea that our chance of winning will follow a sigmoid curve we may be able to specify a logistic regression fit for $P(Win sim Bid)$:


This will add additional structure to the problem and limit the number of experiments required. So the question becomes: what are the bid levels that will efficiently fit a logistic regression model $P(Win sim Bid)$? It may not be immediately obvious if we should follow a strategy of spreading our bids evenly across the input space or focusing on a few specific experimental points that will help us learn the model most quickly.

Design Criterion

D-Optimality is a widely used criterion for statistical experiment design and corresponds to minimizing the determinant of the model error covariance matrix $E$. By minimizing $det E$, we minimize the volume of the confidence ellipsoids surrounding our parameter estimates.

Assuming the $v_i$ are the experimental input vectors and $V$ a vertically stacked matrix of the $v_i$,

For a linear regression model $y = beta x + epsilon$, the corresponding $E = (V^T V)^{-1}$.

As specified in Boyd & Vanderberghe, when given a set of $p$ candidate measurement vectors $v_i in R^n$, determining D-Optimality for the linear regression experiment design problem can be written as:

$quad minimize log det (sum_{i=1}^p lambda_i v_i v_i^T)^{-1}$

$quad quad s.t. lambda_i >= 0,$
$1^T lambda = 1$

Where $ lambda_i $ corresponds to the weight of experimental vector $v_i$.

For the one-dimensional linear regression case, $v_i in R$, it turns out that the D-Optimal experiment design corresponds to placing half of the experiment points at the extreme ends of the input region. Intuitively, if one is measuring the slope of a line with noisy measurements of y for each measurement point x, the best way to do it is by taking measurements at the two endpoints of the line, which will be affected most by the line’s slope and will have a higher signal to noise ratio than the interior points. O’Brien & Funk have additional discussion that of the intuition behind this.

For a logistic regression model $y = dfrac{1}{1+e^{- (beta x + epsilon)}}$, and the corresponding error covariance matrix $E = (V^T W V)^{-1}$, where $W = diag(hat{p}_i * (1-hat{p}_i))$, given $hat{p}_i$ is the estimate of $P(Win)$ for observation $i$.

The auction scenario described above corresponds to the one-dimensional logistic regression case, $v_i in R$. Here, the ends of the input region may have $P(Win) = 0$ or $P(Win) = 1$ and will not give us much useful information about the sigmoid shape in between. In other words, their usefulness as extreme values in the input region is drowned out by their high variance. In the linear regression case this was not an issue as variance was assumed to be constant across the input region. It turns out that the D-Optimal experiment design corresponds to placing half of the experiment points where $P(Win) = 17.6%$, and the other half at $P(Win) = 82.4%$ (see e.g. Dror & Steinberg).

### Optimization Under Uncertainty

Of course, $P(Win)$ is the unknown that we are looking to model in the first place! So our experiment design depends on unknown values, an issue that we did not have in the linear regression case. As another way to see this, note that for the logistic regression model, the error covariance matrix E depends on the estimated parameters whereas for linear regression it does not.

There are a number of ways around this uncertainty issue in practice, including
1. Specifying bayesian priors for the unknown parameters as in Chaloner & Larntz, though this can be computationally expensive.
2. Sweeping the parameters across their feasible regions and choosing the centroid value for each of the experiment support points. This may work
3. Sweeping the parameters across their probable values, generating the optimal experimental points in each case, K-means clustering based on those points, and proceeding with the experiment as in Dror & Steinberg. The optimal experimental points are generated by solving the convex optimization problem listed above. This is an approximation to 1. that has the advantage of being relatively computationally simpler and still quite accurate. Notably, in a one-dimensional logistic regression example they find that $K ge 7$ support points give reasonable coverage of the input space.

Depending on the the cost of running a suboptimal experimental design vs the expense of setting it up, one may not want to implement all of the details required to calculate optimal experimental values as described above. However, the above theory is useful as a guide and leads to some common sense advice for setting up a relatively efficient experiment in the 1-D logistic regression case:
* Aim to have a sufficient number of points in the support set of experimental vectors to give good coverage of the space of possible optimal values
* If there is enough prior information, target the experimental points where $P(Win) approx 17.6%$ or $P(Win) approx 82.4%$ for the possible values of $beta_0$ and the intercept
* If there is enough prior information, avoid points where $P(Win) approx 0%, 50%,$ or $100%$ as these will yield relatively little information.

Jon Sondag
Data Scientist

Post Comments 0