Estimating Proportions

Estimating Probabilities and Proportions

Java Applet Simulation of the mean estimation experiment

The Bernoulli Distribution

Suppose that the random variable X of interest in our basic experiment is an indicator variable. Thus, by definition, X takes only the values 0 and 1 with probabilities

P(X = 0) = 1 - p, P(X = 1) = p

where p in [0, 1] is a parameter. The distribution of X is known as the Bernoulli distribution with parameter p.

$Mathematical Exercise$ 1. Show that the mean and variance are given by

E(X) = p.
var(X) = p(1 - p).

If p is unknown, then by Exercise 1.a, the problem of estimating p is a special case of the more general problem of estimating an unknown mean.

Two Models

The problem of estimating the unknown parameter p in a Bernoulli distribution is important enough to warrant its own section. The problem usually arises in terms of one of the following models:

There is an event of interest in a basic experiment. In this case, X takes the value 1 if and only if the event occurs, and p is the probability of the event.
The basic experiment consists of selecting an object from a population of objects of several different types. In this case, X takes the value 1 if and only if the object is of a particular type of interest, and p is the proportion of objects of that type.

In some cases, the estimation problem can be interpreted in both ways. Here are some concrete examples:

Estimate the probability p that a coin lands with heads showing.
Estimate the probability p that a pair of dice lands with a sum of 7 showing.
Estimate the probability p that a component produced by a certain process is defective.
Estimate the proportion p of defective components in a large population of components.
Estimate the probability p that a person will have an adverse reaction to a certain drug.
Estimate the proportion p of voters who favor a particular candidate.

The Random Sample

Recall that our main assumption is that we repeat the basic experiment n times to generate a random sample of size n from the distribution of X:

(X₁, X₂, ..., X_n)

By definition, these are independent variables, each with the same Bernoulli distribution as X.

In the case of model 2 above, the sample variables will be independent if we are sampling with replacement. In practice, of course, we usually sample without replacement, but in this case the sample variables are dependent. It turns out that the independence assumption is satisfied reasonably well if the population size is very large compared to the sample size. For more on these points, see the discussion of the Ball and Urn Experiment.

When estimating the parameter p in the Bernoulli distribution, it is customary to denote the sample mean by

instead of the usual bar notation. In the context of model 1, the sample mean can be interpreted as the relative frequency of the event of interest. In the context of model 2, the sample mean can be interpreted as the sample proportion of objects of the type of interest.

$Mathematical Exercise$ 2. Show that the sum X₁ + X₂ + ··· + X_n in the sample mean has the binomial distribution with parameters n and p.

Estimation

If n is large, our normal procedure of Section 2 usually gives good, approximate confidence intervals for p. However, by Exercise 1.b, it is never realistic to assume that the distribution standard deviation is known, so we must use the sample standard deviation.

$Mathematical Exercise$ 3. Show that in our new notation, the confidence bounds for p have the form

where, as usual,

z = ± z_{1 - a/2} for the two-sided 1- a confidence interval,
z = -z_{1 - a} for the 1 - a confidence lower bound
z = z_{1 - a}for the 1 - a confidence upper bound

In the following exercises, we can explore the estimation procedure empirically. Make sure that Use S and Use z are selected in the list boxes. This ensures that the simulation will construct the confidence bounds given in Exercise 3.

4. In the mean estimation experiment, select the Bernoulli distribution with p = 0.5. Select two-sided interval and confidence level 0.90. For each of the following sample sizes, run the experiment 1000 times with an update frequency of 10. Note how well the proportion of successful intervals approximates the theoretical confidence level.

n = 5.
n = 10.
n = 30.

5. In the mean estimation experiment, select the Bernoulli distribution with p = 0.8. Select lower bound and confidence level 0.80. For each of the following sample sizes, run the experiment 1000 times with an update frequency of 10. Note how well the proportion of successful intervals approximates the theoretical confidence level.

n = 5.
n = 10.
n = 30.

6. In the mean estimation experiment, select the Bernoulli distribution with p = 0.1. Select upper bound and confidence level 0.60. For each of the following sample sizes, run the experiment 1000 times with an update frequency of 10. Note how well the proportion of successful intervals approximates the theoretical confidence level.

n = 5.
n = 10.
n = 30.

Comparison

Computer simulations allow us to explore procedures that are in practice impossible, such as estimating p assuming that the distribution standard deviation is known. For the following exercise, select Use Sigma and Use z quantiles. The simulation will generate confidence bounds of the form

where z is the appropriate quantile as defined in equations 1, 2, and 3 above.

7. In the mean estimation experiment, select the Bernoulli distribution with p = 0.5. Select two-sided interval and confidence level 0.90. For each of the following sample sizes, run the experiment 1000 times with an update frequency of 10. Note how well the proportion of successful intervals approximates the theoretical confidence level and compare with Exercise 4.

n = 5.
n = 10.
n = 30.

Conservative Bounds

$Mathematical Exercise$ 8. Show that the variance of the Bernoulli distribution is maximized when p = 1/2 and thus the maximum variance is 1/4.

$Mathematical Exercise$ 9. Use the result of Exercise 8 to show that the following formula gives a conservative confidence bound for p:

where z is the appropriate quantile, as defined in equations 1, 2, and 3 above.

Thus, the confidence intervals using the bounds in Exercise 9 will be larger than the confidence intervals using the standard bounds in Exercise 3 or the intervals using the artificial confidence bounds in Exercise 7.

10. In the mean estimation experiment, select Use S and select the Bernoulli distribution with p = 0.5. Select two-sided interval, confidence level 0.90, and sample size n = 30. Run the experiment 100 times, updating after each run. Now for each run, compute the conservative confidence interval as in Exercise 9 and compare with the confidence interval generated by the simulation.