Mean, Variance, and Mean Square Error |
Interactive histogram with mean square error graph
Recall also that in our general notation, we have a data set with n points arranged in a frequency distribution with k classes. The class mark of the i'th class is denoted xi; the frequency of the i'th class is denoted fi and the relative frequency of th i'th class is denoted pi = fi / n.
Recall from Section 2 that the mean, variance, and standard deviation of a distribution are given by
The mean is a very natural measure of center, but the variance and standard deviation might seem a bit strange at first. You may have wondered, for example, why the spread of the distribution about the mean is measured in terms of the squared distances from the values to the mean, instead of some other function. The purpose of this section is to show that mean and variance complement each other in an essential way.
In the applet above, the mean, variance, and standard deviation are recorded numerically in the second table. The mean and standard deviation are shown in the first graph as the horizontal red bar below the x-axis. This bar is centered at the mean and extends one standard deviation on either side.
In a sense, any measure of the center of a distribution should be associated with some measure of error. If we say that the number t is a good measure of center, then presumably we are saying that t represents the entire distribution better, in some way, than other numbers.
In this context, suppose that we measure the quality of t, as a measure of the center of the distribution, in terms of the mean square error
MSE(t) is a weighted average of the squares of the distances between t and the class marks with the relative frequencies as the weight factors. Thus, the best measure of the center, relative to this measure of error, is the value of t that minimizes MSE.
1. Note that MSE is a quadratic function of t. Thus, argue that the graph of MSE is a parabola opening upward.
2. Use standard calculus to show that the variance is the minimum value of MSE and that this minimum value occurs only when t is the mean.
The root mean-square error, RMSE, is the square root of MSE.
3. Using the result of Exercise 2, argue that the standard deviation is the minimum value of RMSE and that this minimum value occurs only when t is the mean.
Exercises 2 and 3 show that the mean is the natural measure of center precisely when variance and standard deviation are used as the measures of spread.
Recall also that we can think of the relative frequency distribution as the probability distribution of a random variable X that gives the mark of the class containing a randomly chosen value from the data set. With this interpretation, the MSE(t) is the second moment of X about t:
MSE(t) = E[(X - t)2]
The results in exercises 1, 2, and 3 hold for general random variables (see the section on the variance).
As before, you can construct a frequency distribution and histogram for a continuous variable x by clicking on the horizontal axis from 0.1 to 5.0. You can select class width 0.1 with 50 classes, or width 0.2 with 25 classes, or width 0.5 with 10 classes, or width 1.0 with 5 classes, or width 5.0 with 1 class. The graph of MSE is shown to the right of the histogram. A red vertical line is drawn from the x-axis to the minimum value of the MSE function. By Exercise 2, this line intersects the x-axis at the mean and has height equal to the variance. Thus, this vertical line in the MSE graph gives essentially the same information as the horizontal bar in the histogram.
4. In the applet, construct a frequency distribution with at least 5 nonempty classes and and at least 10 values total. Compute the min, max, mean and standard deviation by hand, and verify that you get the same results as the applet. Also, explicitly compute a formula for the MSE function.
5. In the applet, set the class width to 0.1 and construct a distribution with at least 30 values of each of the types indicated below. Then increase the class width to each of the other four values. As you perform these operations, note the position and size of the mean ± standard deviation bar and the shape of the MSE graph.
Descriptive Statistics |