The Number of Red Balls |
Simulation of the ball and urn experiment
In the ball and urn experiment, let Y denote the random variable that gives the number of red balls in the sample.
Suppose first that the sampling is with replacement.
1. Show that the colors of successive balls drawn from the urn form a sequence of Bernoulli trials.
2. Use the result of Exercise 1 to show that Y has the binomial distribution with parameters n and p = R/N:
In particular, the mean and variance are
E(Y) = n(R / N), var(Y) = n(R / N) (1 - R / N)
3. In the urn experiment, select sampling with replacement and random variable Y. Vary the parameters and note the shape of the graph of the density function. Now let N = 50, R = 30, and n = 10 and run the experiment with an update frequency of 100. Watch the apparent convergence of the relative frequency function to the density function.
Now suppose that the sampling is without replacement. To derive the probability density function of Y, we can consider the sample as an unordered subset (combination) of size n from the population of size N. Recall that these combinations are equally likely.
4. Show that
This is known as the hypergeometric distribution with parameters N, R, and n. If we adopt the convention that C(m, j) = 0 for j > m then the formula for the density function is correct for k = 0, 1, ..., n.
5. Show the following result combinatorially by treating the outcome as a permutation of size k chosen from the population of N balls. Show the result algebraically, starting from the result in Exercise 4.
6. In the ball and urn experiment, select sampling without replacement and random variable Y. Vary the parameters and note the shape of the graph of the density function. Now let N = 50, R = 30, and n = 10 and run the experiment with an update frequency of 100. Watch the apparent convergence of the relative frequency function to the density function.
Computing the mean and variance of Y directly from the hypergeometric distribution is difficult, so instead we will decompose Y into a sum of indicator variables:
Y = I1 + I2 + ··· + In
where Ij = 1 if the j'th ball is red and Ij = 0 if the j'th ball is green.
In the following problems a key fact is that the joint distribution of any sequence of m indicator variables is the same as that of any other sequence of m indicator variables (the exchangeable property).
7. Show that for any j,
E(Ij) = R / N.
8. Use the result of Exercise 7 to show that
E(Y) = n(R / N)
9. Show that
var(Ij) = (R / N)(1 - R / N)
10. Use basic properties of covariance and Exercises 7 and 9 to show that for distinct j and k,
Note from Exercise 10 that the event of a red ball on draw j and the event of a red ball on draw k are negatively correlated, but the correlation depends only on the population size and not on the number of red balls. Note also that the correlation is perfect if N = 2. Think about these result intuitively.
11. In the urn experiment, set N = 50, R = 20, and n = 10. Now run the experiment 500 times, updating after each run. Compute the empirical correlation of the events of a red ball on draw 3 and a red ball on draw 7. Compare with the theoretical result in Exercise 10.
12. Use the results of Exercise 9 and 10 and basic properties of covariance to show that
13. Compare the mean and variance of Y when the sampling is with replacement and when the sampling is without replacement. For which distribution is the variance smaller? Does the result seem reasonable?
Suppose now that R depends on N and that
We will show that for fixed n, the hypergeometric distribution with parameters N, R, and n converges to the binomial distribution with parameters n and p. Intuitively, this means that if N is large compared to n, then sampling n items without replacement is not too much different than sampling n items with replacement, and hence the hypergeometric distribution can be approximated by the binomial.
14. Use the result of Exercise 5 to show that
15. Complete the proof by showing that for fixed n and k,
16. In the Ball and Urn simulation set N = 100, n = 10, and R = 30. Run the simulation 1000 times, updating every 100 runs. Compare the relative frequency function, the hypergeometric density function, and the approximating binomial density function.
The Ball and Urn Experiment |