Median, Quartiles, and Boxplots

Java Applet Interactive histogram with boxplot

Frequency Distributions

Recall also that in our general notation, we have a data set with n points arranged in a frequency distribution with k classes. The class mark of the i'th class is denoted x_i; the frequency of the i'th class is denoted f_i and the relative frequency of th i'th class is denoted p_i = f_i / n.

Ranks

The rank of a value in a data set is the position of the value when the data set is ordered from smallest to largest. Recall again that we think of a frequency distribution as an approximation of the original data set in which all values in a given class are "rounded" to the class mark. Thus, for a frequency distribution, we define the ranks as follows:

The f₁ points in the first class occupy ranks 1 to f₁.
The f₂ points in the second class occupy ranks f₁ + 1 to f₁ + f₂.
···
The f_k points in the k'th class occupy ranks f₁ + ··· + f_k_-₁ + 1 to n.

Median

Quantiles are values that are a given fraction of the way through the data set. The most important of these quantiles is the median, the value that is roughly in the middle of the data set. If n is odd, the median is the single value in the middle, namely the value with rank (n + 1)/2. If n is even, there is not a single value in the middle, so the median is defined to be the average of the two middle values, namely the values with ranks n/2 and n/2 + 1.

From the frequency distribution, we can find the median as follows: If n is odd, find the smallest j such that

and then the median is x_j. If n is even, find the smallest j and l such that

and then the median is (x_j + x_l)/2.

Quartiles

The median is a measure of the center of a distribution, based on the fact that roughly half of the data set falls below the median and half falls above the median. However, a measure of the center of a distribution is much more useful if there is a corresponding measure of dispersion, that tells us how the distribution is spread out with respect to the center. For the median, a natural measure of dispersion can be obtained from the lower and upper quartiles.

The lower quartile is the value that is roughly 1/4 of the way through the data set and the upper quartile is the value that is roughly 3/4 of the way through the data set. There are minor variations in the formal definitions of these statistics in the literature, but we will use the following simple versions: the lower quartile, denoted Q₁, is the median of the values in the data set that are less than or equal to the median of the entire set. The upper quartile, denoted Q₃, is the median of the values in the data set that are greater than or equal to the median of the entire set. The median of the entire data set is itself a quartile (the second quartile) and hence is often denoted Q₂.

Since the lower and upper quartiles are themselves medians of reduced data sets, they can be computed by the general algorithm given above for computing medians.

The interquartile range is defined to be

IQR = Q₃ - Q₁.

The IQR gives a single number that measures the spread of the distribution about the median, but of course this number gives less information than the interval [Q₁, Q₃].

The Boxplot

The five parameters

min, Q₁, Q₂, Q₃, max

are often referred to as the five-number summary. Together, these parameters give a great deal of information about the distribution in terms of the center, spread, and skewness. Graphically, the five numbers are often displayed as a boxplot, which consists of a line extending from the min to the max, with a rectangular box from Q₁ to Q₃, and tick marks at the min, median and max.

The Applet

As before, you can construct a frequency distribution and histogram for a continuous variable x by clicking on the horizontal axis from 0.1 to 5.0. You can select class width 0.1 with 50 classes, or width 0.2 with 25 classes, or width 0.5 with 10 classes, or width 1.0 with 5 classes, or width 5.0 with 1 class. The boxplot is shown graphically below the x-axis, and the five numbers are recorded in the second table.

Exercises

1. In the applet, construct a frequency distribution with at least 6 classes and at least 10 values. Compute the parameters in the five-number summary by hand and verify that you get the same results as the applet.

2. In the applet, set the class width to 0.1 and construct a distribution with at least 30 values of each of the types indicated below. Then increase the class width to each of the other four values. As you perform these operations, note the shape of the boxplot and the relative positions of the parameters in the five-number summary:

A uniform distribution.
A symmetric, unimodal distribution.
A unimodal distribution that is skewed right.
A unimodal distribution that is skewed left.
A symmetric bimodal distribution.
A U-distribution.

3. In each case below, start with a distribution in which the five parameters satisfy

0.1 < min < Q₁ < Q₂< Q₃ < max < 5.0

Add one additional point as described and note the effect on the boxplot:

Add a point between 0.1 and min
Add a point between min and Q₁.
Add a point between Q₁ and Q₂.
Add a point between Q₂ and Q₃.
Add a point between Q₃ and max.
Add a point between max and 5.0.

In problem 3, you may have noticed that when you add an additional point to the distribution, one or more of the five parameters does not change. In general, quantiles can be relatively insensitive to changes in the data.