The Coupon Collector Problem

Java Applet Simulation of the coupon collector experiment

Random Variables

We will now study the coupon collector experiment of distributing balls randomly into the n cells indefinitely. Thus, we have the random variables

X₁, X₂, X₃, ... where X_i is the number of the cell containing ball i.

Our basic modeling assumption is that these random variables are independent and each is uniformly distributed on {1, 2, ..., n}. Our random variable of interest is the number of balls required until j cells are occupied. This variable can be written simply in terms of the number of occupied cells when k balls are distributed into n cells:

W_n,j = min{k: V_n,k = j}

The possible values of this variable are j, j + 1, j + 2, ....

The Simulation

In the coupon collector experiment, the total number of cells n and the number of cells to be occupied j can be varied with scroll bars. When you run the experiment, the cell counts of the occupied cells are shown in the large table. The density and moments of the number of balls required W is displayed in a table and in a graph. The value of W on each update is recorded in a table.

1. In the coupon collector experiment, set n = 50 and vary j. Note the shape of and position of the density graph. With j = 20, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency distribution to the true distribution.

The Density Function

Now let's find the distribution of the number of balls needed to occupy j cells. The results of the previous section will be very helpful

$Mathematical Exercise$ 2. Argue that

W_n,j = k if and only if V_n,k-1 = j - 1 and V_n,k = j.

$Mathematical Exercise$ 3. Use Exercise 2 and a conditional probability argument to show that

4. In the coupon collector experiment, set n = 100 and vary j. Note the shape of and position of the density graph. With j = 50, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency distribution to the true distribution.

Moments

We will now show that the number of balls needed to occupy j cells can be decomposed as a sum of j independent variables. This will provide some additional insight into the nature of the distribution and will make the computation of the mean and variance easy.

For i = 1, 2, ... n, let Y_i denote the number of additional balls needed to go from i - 1 occupied cells to i occupied cells.

$Mathematical Exercise$ 5. Argue that

Y₁, Y₂, ..., Y_n are independent.
Y_i has the geometric distribution with parameter p_i = (n - i + 1) / n.
W_n,j = Y₁ + Y₂ + ··· + Y_n.

Exercise 5 shows in a clear way that, each time a new cell is occupied, it becomes harder for another new cell to become occupied.

6. In the coupon collector experiment, set n = 30 and vary j. Note the shape of and position of the density graph. With j = 25, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the sample statistics to the distribution parameters.

$Mathematical Exercise$ 7. Use the results of Exercise 5 to show that

$Mathematical Exercise$ 8. Use the results of Exercise 5 to show that

$Mathematical Exercise$ 9. Use Exercises 7 and 8 to explicitly compute the mean and variance of the number of balls needed to occupy all j = 10 cells when there are n = 10 cells.

10. In the coupon collector experiment, set n = 10 and vary j. Note the shape of and position of the density graph. With j = 10, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the sample statistics to the distribution parameters.

$Mathematical Exercise$ 11. Use the Result of Exercise 5 to find the moment generating function of the number of balls needed to occupy j cells.

Recurrence Relation

An alternate approach to the distribution of the number of balls needed to occupy j cells is via a recursion formula.

$Mathematical Exercise$ 12. Let c_n,j(k) = P(W_n,j = k) for k = j, j + 1, ...

Use a conditional probability argument to show that