Sunday, January 30, 2011

Random Samples from a huge set - Intuitive reservoir sampling

Consider a set of cardinality N. We are given a problem of collecting 'n' random samples from this set. This can be easily done with standard random number generators.

Please review my another blog, if you are not comfortable with Fisher-Yates algorithm for Random shuffling. http://karthikpresumes.blogspot.com/2011/06/random-shuffling-intuitive-fisher-yates.html

Now a bit more complex sampling. What if the number 'N' is not known a priori. So, we couldn't generate random numbers, as we don't know the range.
Interestingly, what if the parent set is keep growing; N is time-varying.

Two Pass algorithm:

Let us consider a situation where 'N' is not varying, but not known a priori. In this case, N could be calculated with a single pass of the set. After this pass, problem boils down to initial one, we discussed. Cool, but number of passes needed is two.

Single Pass Reservoir Sampling algorithm:

Is there any way to make it one pass? Yes. Reservoir sampling.


In this post, I will try to explain this algorithm in an intuitive way.

Base case:

Let us assume the samples set as reservoir of 'n' numbers. If n is equal to N, problem is base case, which is trivial. Reservoir in this case would have all the samples, perhaps in a different order.

Simple case:

Now, assume N = n+1, we need n samples out of n + 1.

Initially fill the reservoir with first n elements. After processing element at index n+1, reservoir's elements should be having the probability n / (n + 1) of being placed in reservoir. It means, n + 1th element would be placed with probability n / (n+1) . This could be easily achieved by generating a random number between 1 .. n + 1. If generated one is above n, then place the element in reservoir.

Placing an element in filled reservoir, needs some element else gets removed. Please remember, after all, probability of elements being in Reservoir should be n /N.

Probability of an element, x in reservoir to get removed is selecting n + 1 th element for being in Reservoir and selecting the element x for replacing; which is n / (n + 1) * 1 / n => 1 / (n + 1). So, probability of staying of x is n / (n + 1), 1 - 1 /(n + 1), which is the required probability. This could be easily generalized for any index above n.

Generalization:

For general index i, probability of being placed in Reservoir is n / i. Probability of just removing a random element from Reservoir is 1 / n. So, combined probability of a random element to get replaced is 1 / i. Probability of the same element to stay in Reservoir is (i - 1) / i, which was there in Reservoir with Probability n / (i - 1). So, combined probability is n / i.

Luckily, this will work for time-varying 'n' also.

Thanks for reading.

No comments:

Post a Comment