Frequency Distributions

Java Applet Interactive histogram

Variables

In statistics, a variable is an assignment of a number to each element of the population. Thus, mathematically, a variable is actually a function defined on the population. If the population is a group of people, for example, then typical variables of interest might be height, weight, number of cars owned, and so on.

A discrete variable is one whose set of possible values is finite or countably infinite. Discrete variables are frequently counting variables, like the number of cars owned, in the example above. By contrast, a continuous variable is one whose set of possible values is an interval of real numbers. Continuous variables represent quantities, such as height and weight in the example above, that can, in theory, be measured to any degree of accuracy. In practice, of course, measuring devices have limited accuracy so data collected from a continuous variable is necessarily discrete. That is, there is only a finite (but perhaps very large) set of possible values that can actually be measured.

Frequency Distributions and Histograms

A frequency distribution is a summary of the data set in which the interval of possible values is divided into subintervals, known as classes. For each class, the number of data values in that class is recorded; this is the frequency of the class. The relative frequency of the class is the frequency of the class divided by the number of values in the data set.

An essential requirement for a frequency distribution is that the classes be mutually exclusive and exhaustive. That is, each value in the data set must belong to one and only one class. A desirable, but not essential requirement is that the classes have the same width.

A histogram is simply a bar chart of a frequency distribution. For each class, a rectangle is drawn whose base is the class (on the horizontal axis) and whose height is the frequency (or relative frequency).

The Applet

In the frequency distribution applet, the horizontal axis represents a continuous variable x. You can click on the axis from 0.1 to 5.0, to generate a data set. We are assuming that our measuring device, the mouse, is accurate to one decimal, so the values that you generate are stored by the computer to this accuracy. The frequencies and relative frequencies are recorded in the table on the left. As you click on the axis, the computer also draws the histogram of the frequency distribution.

You can choose any of five types of distributions

50 classes of width 0.1: [0.05, 0.15), [0.15, 0.25), ..., [4.95, 5.05).
25 classes of width 0.2: [0.05, 0.25), [0.25, 0.45), ..., [4.85, 5.05).
10 classes of width 0.5: [0.05, 0.55), [0.55, 1.05), ..., [4.55, 5.05).
5 classes of width 1.0: [0.05, 1.05), [1.05, 2.05), ..., [4.05, 5.05).
1 class of width 5.0: [0.05, 5.05).

1. Click on the x-axis at various points to generate a data set with 20 values. Vary the class width over the five values from 0.1 to 5.0 and then back again. For each choice of class width, switch between the frequency histogram and the relative frequency histogram. Note how the shape of the histogram changes as you perform these operations.

As you can see, there is a tradeoff between the number of classes and the width of the classes; these determine the resolution of the frequency distribution. At one extreme, when the class width is 0.1, each class contains a single distinct value, because we are assuming that the original data is recorded to one decimal accuracy. In this case, there is no loss of information and we can recover the original data set from the frequency distribution. On the other hand, it can be hard to see the shape of the data when we have many classes of small width.

2. Set the class width to 0.1. click on the x-axis to generate a data set with 10 distinct values and 20 values total. From the frequency distribution, explicitly write down the 20 values in the data set.

At the other extreme, when the class width is 5.0, there is only one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the values in the data set.

Between these two extreme cases, when the width is 0.2, 0.5, or 1.0, the frequency distribution gives us partial information, but not complete information. These intermediate cases can show the shape of the data in a useful way.

3. For the distribution in Exercise 2, increase the class width to 0.2, 0.5, 1.0, and 5.0. Note how the histogram loses resolution; that is, how the frequency distribution loses information about the original data set.

It is important to realize that frequency data is inevitable for a continuous variable. For example, suppose that our variable represents the weight of a person (in pounds) and that our measuring device (a scale) is accurate to 0.1 pound. If we measure a person's weight as 153.2, then we are really saying that the weight is in the interval [153.15, 153.24). Similarly, when two persons have the same measured weight, the apparent equality of the weights is really just an artifact of the imprecision of the measuring device; actually the two persons almost certainly do not have the exact same weight. Thus, two persons with the same measured weight really give us a frequency count of 2 for a certain interval. One of the main purposes of this module is to encourage you to always think of data in terms of distributions

Definitions and Notation

In general, suppose that we have a frequency distribution for a continuous variable x with k classes. We will denote the class boundaries by

a₀, a₁, ..., a_k

so that the i'th class has lower boundary a_i_–1 and upper boundary a_i.

The frequency of the i'th class will be denoted f_i for i = 1, 2, ..., k. Because of the mutually exclusive and exhaustive property of the frequency distribution we must have

f₁ + f₂ + ··· + f_k = n.

where n is the number of values in the data set. The relative frequency of the i'th class is

p_i = f_i / n.

Note that the relative frequencies must sum to 1:

p₁ + p₂ + ··· + p_k = 1.

If we know n, the number of values in the data set, then the frequency and the relative frequency of a class are equivalent, in the sense that if we know one of these, we can find the other.

4. Click on the x-axis to generate a data set with at least 10 classes and at least 20 values total. Vary the class width over the five values and for each class width, switch between the frequency histogram and the relative frequency histogram. .Note that the frequency histogram and the relative frequency histogram look the same, except for the scale on the vertical axis.

The width of the i'th class is

w_i = a_i - a_i_-1.

When the class widths are all the same, we will use w to denote the common value.

The class mark of a class is the midpoint of the class. For the i'th class, we will denote this by

x_i = (a_i_-1 + a_i) / 2.

It is usually best to think of a frequency distribution as an approximation of the original data set in which all the values in a class have been "rounded" to the class mark.

Parameters and Statistics.

In general, a frequency distribution can represent the variable x over the entire population or over a sample (subset) of the population. In either case, numerical characteristics that capture interesting features of the distribution are important. When the distribution represents the entire population, such characteristics are called parameters; when the distribution represents a sample of the population, such characteristics are called statistics.

The distinction is not important in this module, so we will assume that the frequency distribution represents the entire population

Minimum and Maximum Values

The minimum value of a distribution to be the smallest class mark whose class has positive frequency and the maximum value to be the largest class mark whose class has positive frequency. These are parameters of the distribution.

In the applet, the number of points n and the minimum and maximum value are recorded in the second table.

Types of Distributions

A uniform distribution is one in which all the non-empty classes have the same frequency.

A modal class is any class with maximum frequency. A unimodal distribution is one whose histogram has a single peak, so that the frequencies at first increase and then decrease. A bimodal distribution is one whose histogram has two peaks, so that the frequencies at first increase, then decrease, then increase again, and finally decrease again. Similarly, there can be trimodal distributions and so on.

A distribution is said to be symmetric if the histogram is roughly symmetric with respect to one of the class marks x_j, so that classes that are the same distance to the right and to the left of x_j have the same frequency.

A unimodal distribution is said to be skewed right if the histogram has a long tail to the right of the modal class. The distribution is said to be skewed left if the histogram has a long tail to the left of the modal class. Thus, skewed distributions are not symmetric.

A U-distribution is one whose histogram has the shape of the letter U, with large frequencies near the minimum and maximum values and small frequencies in the middle.

5. In each case below, set the class width to 0.1 and click on the axis to generate a distribution of of the given type with 30 points. Now increase the class width to each of the other four values and describe the type of distribution.

A uniform distribution
A symmetric unimodal distribution
A unimodal distribution that is skewed right.
A unimodal distribution that is skewed left.
A symmetric bimodal distribution
A U-distribution.

Probability Distributions

A relative frequency distribution has the mathematical structure of a discrete probability distribution. Indeed, suppose we perform the random experiment of selecting a value at random from the data set and recording the mark X of the class containing the value. Then X is a discrete random variable with density function

P(X = x_i) = p_i for i = 1, 2, ..., k