• Posted by Intent Media 07 Oct
  • 0 Comments

Random Assumptions will make an A– Part 2

Overview

In our previous post, Random Assumptions can make an A— Part 1, we discussed a solution that we built to build sticky random numbers. In this blogpost we’ll examined in detail the bug that we introduced, and our efforts and final solution to the problem.

The Initial (not so good) solution

We wanted to generate several sticky random numbers for a user based off a user_id for our multivariate tests. We used a combination of the user_id, the multivariate test attribute id, and some salt to generate a seed for the Random generator. Our solution looked something like this:

int seed = String.format("%s some salt", userId).hashCode();
double diceRoll = Random(seed).nextDouble();

Having implemented this we patted ourselves on the back and went for a beer.

A few weeks later we found out that our solution was buggy. During an analysis of some data, we found a high amount of covariance with our tests. What is covariance, you might ask? A formal definition is available at Wikipedia of course, but we can demonstrate the problem using our original example.

We have a test with two designs Design Awesome and Design Sweet with a 50-50 split and with two ad copies “Look at me!” and “I am a sweet ad copy”. Given this test we would expect to see the follow distribution of users:

| Design         | Ad Copy                | Expected % of Users |
-----------------------------------------------------------------
| Design Awesome | "Look at me!"          |  25                 |
| Design Awesome | "I am a sweet ad copy" |  25                 |
| Design Sweet   | "Look at me!"          |  25                 |
| Design Sweet   | "I am a sweet ad copy" |  25                 |

We won’t get exactly 25% in each bucket but it should be reasonably within range. Well when we looked at the numbers we were way off:

| Design  | Ad Copy   | Expected % of Users | Actual % of Users |
-----------------------------------------------------------------
| Awesome | "Look..." |  25                 | 36                |
| Awesome | "I am..." |  25                 | 12                |
| Sweet   | "Look..." |  25                 | 15                |
| Sweet   | "I am..." |  25                 | 37                |

These numbers are way off. And when we looked at our tests with more attributes they were even more skewed. We obviously had a new problem.

Random Number Basics

Remember that the Random library provided by Java is actually a Pseudorandom Generator. It takes a seed number, and every successive call to it returns a “random” number. This number is actually generated by applying a math function to the seed and the number of successive calls. There is a lot of math involved but check out Wikipedia’s article Random Number Generation if you are interested in learning more

One of the most important features of Random is that it has a ‘normal’ distribution. A ‘normal’ distribution means that the dice rolls are close to evenly divided. For example, if you flipped a coin 100 times you would expect it to come up heads about 50 times (45 times would be perfectly acceptable). Similarly if you called `Random.nextInt(2)` 100 times you would expect it come up as `0` about 50 times. And it does.

So of course our use should have universal distribution, right…?

The Assumption

So yeah we were wrong, because we were using Random in an unexpected way. Random does have normal distribution for each successive call with a single seed number, but the numbers from seed to seed are NOT normal distribution. This may be a bit confusing to understand at first, so it’s helpful to visualize the problem.

Imagine all the random numbers generated by `Random` in one large table:

| Seed | first call | second call | third call | ... 
---------------------------------------------------
| 1    | 0.5889789  | 0.78998897  | 0.0029884  | ...
| 2    | 0.5889788  | 0.78998898  | 0.0029885  | ...
| 3    | 0.5889787  | 0.78998899  | 0.0029886  | ...
| 4    | 0.5889786  | 0.78998900  | 0.0029887  | ...   

(Note these values aren’t accurate; this is just for illustration purposes)

If you examine a row of values (left to right) you would find the number distribution to appear normal. However if you examine a column of values (top to bottom) you find that the numbers are obviously not normal.

This was our critical assumption that shot us in the foot. Because we were constantly switching the seed number we were using the column of values and not the row. Our seed numbers were also very close together, so it threw our normality right out the window, and caused the covariance that we were seeing.

The solution

Given all of the above, we decided to use our own random function for these purposes:

    
long attributeIdSha1 = Long.parseLong(DigestUtils.shaHex(attributeId.toString()).substring(0, END_INDEX), RADIX);
double longValue = (double) ((Math.abs(randomizedUserIdHash ^ attributeIdSha1) % PARTITION_SIZE))/ PARTITION_SIZE;  
    

Essentially we SHA1 the attribute id for each multivariate test attribute to get 4 bytes and then XOR with the randomizedUserIdHash that we previous assigned to get a new random number. Again its a pseudo random, but in our testing we saw enough of a distribution to feel confident in our solution.

We replayed a large set of ad calls against this new algorithm and found that the covariance disappeared. We had our solution.

Lessons learned

Random number generation is a hard problem. And while it’s great that all these libraries our out there to help us with this stuff, it’s really important to understand how these libraries work so your assumptions don’t catch you off guard.

Orion Delwaterman

Post Comments 0