Mean, Variance, and Mean Square Error

Home

Java Applet Interactive histogram with mean square error graph


Frequency Distributions

Recall also that in our general notation, we have a data set with n points arranged in a frequency distribution with k classes. The class mark of the i'th class is denoted xi; the frequency of the i'th class is denoted fi and the relative frequency of th i'th class is denoted pi = fi / n.

Mean, Variance and Standard Deviation

Recall from Section 2 that the mean, variance, and standard deviation of a distribution are given by

The mean is a very natural measure of center, but the variance and standard deviation might seem a bit strange at first. You may have wondered, for example, why the spread of the distribution about the mean is measured in terms of the squared distances from the values to the mean, instead of some other function. The purpose of this section is to show that mean and variance complement each other in an essential way.

In the applet above, the mean, variance, and standard deviation are recorded numerically in the second table. The mean and standard deviation are shown in the first graph as the horizontal red bar below the x-axis. This bar is centered at the mean and extends one standard deviation on either side.

Mean Square Error

In a sense, any measure of the center of a distribution should be associated with some measure of error. If we say that the number t is a good measure of center, then presumably we are saying that t represents the entire distribution better, in some way, than other numbers.

In this context, suppose that we measure the quality of t, as a measure of the center of the distribution, in terms of the mean square error

MSE(t) is a weighted average of the squares of the distances between t and the class marks with the relative frequencies as the weight factors. Thus, the best measure of the center, relative to this measure of error, is the value of t that minimizes MSE.

Mathematical Exercise 1. Note that MSE is a quadratic function of t. Thus, argue that the graph of MSE is a parabola opening upward.

Mathematical Exercise 2. Use standard calculus to show that the variance is the minimum value of MSE and that this minimum value occurs only when t is the mean.

The root mean-square error, RMSE, is the square root of MSE.

Mathematical Exercise 3. Using the result of Exercise 2, argue that the standard deviation is the minimum value of RMSE and that this minimum value occurs only when t is the mean.

Exercises 2 and 3 show that the mean is the natural measure of center precisely when variance and standard deviation are used as the measures of spread.

Recall also that we can think of the relative frequency distribution as the probability distribution of a random variable X that gives the mark of the class containing a randomly chosen value from the data set. With this interpretation, the MSE(t) is the second moment of X about t:

MSE(t) = E[(X - t)2]

The results in exercises 1, 2, and 3 hold for general random variables (see the section on the variance).

The Applet

As before, you can construct a frequency distribution and histogram for a continuous variable x by clicking on the horizontal axis from 0.1 to 5.0. You can select class width 0.1 with 50 classes, or width 0.2 with 25 classes, or width 0.5 with 10 classes, or width 1.0 with 5 classes, or width 5.0 with 1 class. The graph of MSE is shown to the right of the histogram. A red vertical line is drawn from the x-axis to the minimum value of the MSE function. By Exercise 2, this line intersects the x-axis at the mean and has height equal to the variance. Thus, this vertical line in the MSE graph gives essentially the same information as the horizontal bar in the histogram.

Additional Exercises

Simulation Exercise 4. In the applet, construct a frequency distribution with at least 5 nonempty classes and and at least 10 values total. Compute the min, max, mean and standard deviation by hand, and verify that you get the same results as the applet. Also, explicitly compute a formula for the MSE function.

Simulation Exercise 5. In the applet, set the class width to 0.1 and construct a distribution with at least 30 values of each of the types indicated below. Then increase the class width to each of the other four values. As you perform these operations, note the position and size of the mean ± standard deviation bar and the shape of the MSE graph.

  1. A uniform distribution.
  2. A symmetric, unimodal distribution.
  3. A unimodal distribution that is skewed right.
  4. A unimodal distribution that is skewed left.
  5. A symmetric bimodal distribution.
  6. A U-distribution.

Descriptive Statistics

PreviousNext