Radial Basis Functions. A type of neural network employing a hidden layer of radial units and an output layer of linear units, and characterized by reasonably fast training and reasonably compact networks. Introduced by Broomhead and Lowe (1988) and Moody and Darkin (1989), they are described in most good neural network text books (e.g., Bishop, 1995; Haykin, 1994). See, Neural Networks".

Radial Sampling (in Neural Networks). Radial sampling is a simple technique to assign centers to radial units in the first hidden layer of a network by randomly sampling training cases and copying those to the centers. This is a reasonable approach if the training data are distributed in a representative manner for the problem (Lowe, 1989).

The number of training cases must at least equal the number of centers to be assigned.

Random Effects (in Mixed Model ANOVA). The term random effects in the context of analysis of variance is used to denote factors in an ANOVA design with levels that were not deliberately arranged by the experimenter (those factors are called fixed effects), but which were sampled from a population of possible samples instead. For example, if one were interested in the effect that the quality of different schools has on academic proficiency, one could select a sample of schools to estimate the amount of variance in academic proficiency (component of variance) that is attributable to differences between schools.

A simple criterion for deciding whether or not an effect in an experiment is random or fixed is to ask how one would select (or arrange) the levels for the respective factor in a replication of the study. For example, if one wanted to replicate the study described in this example, one would choose (take a sample of) different schools from the population of schools. Thus, the factor "school" in this study would be a random factor. In contrast, if one wanted to compare the academic performance of boys to girls in an experiment with a fixed factor Gender, one would always arrange two groups: boys and girls. Hence, in this case the same (and in this case only) levels of the factor Gender would be chosen when one wanted to replicate the study.

Random Sub-Sampling in Data Mining. When mining huge data sets with many millions of observations, it is neither practical nor desirable to process all cases (although efficient incremental learning algorithms exist to perform predictive data mining using all observations in the dataset). For example, by properly sampling only 100 observations (from millions of observations) you can compute a very reliable estimate of the mean. One of the rules of statistical sampling that is often not intuitively understood by untrained "observers" is the fact that the reliability and validity of results depend, among many other things, on the size of a random sample, and not on the size of the population from which it is taken. In other words, the mean estimated from 100 randomly sampled observations is as accurate (i.e., falls within the same confidence limits) regardless of whether the sample was taken from 1000 cases or 100 billion cases. Put another way, given a certain (reasonable) degree of accuracy required, there is absolutely no need to process and include all observations in the final computations (for estimating the mean, fitting models, etc.).

Range Plots - Boxes. In this style of range plot, the range is represented by a "box" (i.e., as a rectangular box where the top of the box is the upper range and the bottom of the box is the lower range). The midpoints are represented either as point markers or horizontal lines that "cut" the box.

Range Plots - Columns. In this style of range plot, a column represents the mid-point (i.e., the top of the column is at the mid- point value) and the range (represented by "whiskers") is overlaid in the column.

Range Plots - Whiskers. In this style of range plot (see example above), the range is represented by "whiskers" (i.e., as a line with a serif on both ends). The midpoints are represented by point markers.

Rank. A rank is a consecutive number assigned to a specific observation in a sample of observations sorted by their values, and thus reflecting the ordinal relation of the observation to others in the sample. Depending on the order of sorting (ascending or descending), the higher ranks represent the higher values (i.e., ascending ranks, the lowest value is assigned a rank of 1, and the highest value - the "last" (highest) rank) or higher ranks represent the lower values (i.e., descending ranks, the highest value is assigned a rank of 1). See ordinal scale and Coombs, 1950.

Rank Correlation. A rank correlation coefficient is a coefficient of correlation between two random variables that is based on the ranks of the measurements and not the actual values, for example, see Spearman R, Kendall tau, Gamma. Detailed discussions of rank correlations can be found in Hays (1981), Kendall (1948, 1975), Everitt (1977), and Siegel and Castellan (1988). See also Nonparametric Statistics.

Ratio Scale. This scale of measurement contains an absolute zero point, therefore it allows you to not only quantify and compare the sizes of differences between values, but also to interpret both values in terms of absolute measures of quantity or amount (e.g., time; 3 hours is not only 2 hours more than 1 hour, but it is also 3 times more than 1 hour).

See also Multiple Regression.

Regression (in Neural Networks). In regression problems the purpose is to predict the value of a continuous output variable. Regression problems can done in Neural Networks using multilayer perceptrons, radial basis function networks, (Bayesian) regression networks, and linear networks.

Output Scaling. Multilayer perceptrons include Minimax scaling of both input and output variables. When the network is trained, shift and scale coefficients are determined for each variable, based on the minimum and maximum values in the training set, and the data is transformed by multiplying by the scale factor and adding the shift factor.

The net effect is that a 0.0 output activation level in the network is translated into the minimum value encountered in the training data, and a 1.0 activation level is translated into the maximum training data value. Consequently, the network is able to interpolate between the values represented in the training data. However, extrapolation outside the range encountered in the training set is more circumscribed. Two approaches to encoding the output are available, each of which allows a certain amount of extrapolation.

A logistic activation function is used for the output, with scaling factors determined so that the range encountered in the training set is mapped to a restricted part of the logistic functions (0,1) range (e.g. to [0.05, 0.95]. This allows a small amount of extrapolation (significant extrapolation from data is usually unjustified anyway). Using the logistic function makes training stable.
Uses an identity activation function in the final layer of the network. This supports a substantial amount of extrapolation, although not unlimited (the hidden units will saturate eventually). As a bonus, the final layer can be "fine-tuned" after iterative training using the pseudo-inverse technique. However, iterative training tends to be less stable than with a non-linear activation function, and the learning rate must be carefully chosen to avoid weight divergence during training (i.e. less than 0.1), if using an algorithm such as back propagation.

Outliers. Regression networks can be particularly prone to problems with outlying data. The use of the sum-squared network error function means that points lying far from the others have a disproportionate influence on the position of the hyperplanes used in regression. If these points are actually anomalies (for example, spurious points generated by the failure of measuring devices) they can substantially degrade the network's performance.

One approach to this problem is to train the network, test it on the training cases, isolate those that have extremely high error values and remove them, then to retrain the network.

If you believe the outlier is caused by a suspicious value for one of the variables in that case, you can delete that particular value, at which point the case is treated as having a missing value (see Missing Values, below).

Another approach is to use the city-block error function. Rather than summing the squared-differences in each variable to work out an error measure, this simply sums the absolute differences. Removing the square function makes training far less sensitive to outliers.

Whereas the amount of "pull" a case has on a hyperplane is proportional to the distance of the point from the hyperplane in the sum-squared error function, with the city block error function the pull is the same for all points, and the direction of pull simply depends on the side of the hyperplane to which the point lies. Effectively, the sum-squared error function attempts to find the mean, but the city-block error function attempts to find the median.

Missing Values. It is not uncommon to come across situations where the data for some cases has some values missing; perhaps because data was unavailable, or corrupted, when gathered. In such cases, you may still need to execute a network (to get the best estimate possible given the information available) or (and this is more suspect) use the partially complete data in training because of an acute shortage of training data.

Where possible, it is usually good practice not to use variables containing a great many missing values. Cases with missing values can be excluded.

Regression Summary Statistics (in Neural Networks). In regression problems, the purpose of the neural network is to learn a mapping from the input variables to a continuous output variable, or variables.

A network is successful at regression if it makes predictions that are more accurate than a simple estimate.

The simplest way to construct an estimate, given training data, is to calculate the mean of the training data, and use that mean as the predicted value for all previously unseen cases.

The average expected error from this procedure is the standard deviation of the training data. The aim in using a regression network is therefore to produce an estimate that has a lower prediction error standard deviation than the training data standard deviation.

The regression statistics are:

Data Mean. Average value of the target output variable.

Data S.D. Standard deviation of the target output variable.

Error Mean. Average error (residual between target and actual output values) of the output variable.

Abs. E. Mean. Average absolute error (difference between target and actual output values) of the output variable.

Error S.D. Standard deviation of errors for the output variable.

S.D. Ratio. The error:data standard deviation ratio.

Correlation. The standard Pearson-R correlation coefficient between the predicted and observed output values.

The degree of predictive accuracy needed varies from application to application. However, generally an s.d. ratio of 0.1 or lower indicates very good regression performance.

Regular Histogram. This simple histogram will produce a column plot of the frequency distribution for the selected variable (if more than one variable is selected, then one graph will be produced for each variable in the list).

Regularization (in Neural Networks). A modification to training algorithms which attempts to prevent over- or under-fitting of training data by building in a penalty factor for network complexity (typically by penalizing large weights, which correspond to networks modeling functions of high curvature) (Bishop, 1995).

See also Neural Networks.

Relative Function Change Criterion. The relative function change criterion is used to stop iteration when the function value is no longer changing (see Structural Equation Modeling). Basically, it stops iteration when the function ceases to change. The criterion is necessary because, sometimes, it is not possible to reduce the discrepancy function even when the gradient is not close to zero. This occurs, in particular, when one of the parameter estimates is at a boundary value. The "true minimum," where the gradient actually is zero, includes parameter values that are not permitted (like negative variances, or correlations greater than one).

On the i'th iteration, this criterion is equal to

Reliability. There are two very different ways in which this term can be used:

Reliability and item analysis. In this context reliability is defined as the extent to which a measurement taken with multiple-item scale (e.g., questionnaire) reflects mostly the so-called true score of the dimension that is to be measured, relative to the error. A similar notion of scale reliability is sometimes used when assessing the accuracy (and reliability) of gages or scales used in quality control charting. For additional details refer to the Reliability and Item Analysis chapter, or the description of Gage Repeatability/Reproducibility Analysis in the Process Analysis chapter.

Weibull and reliability/failure time analysis. In this context reliability is defined as the function that describes the probability of failure (or death) of an item as a function of time. Thus, the reliability function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the reliability function is also sometimes referred to as the survivorship or survival function (since it describes the probability of not failing or surviving until a certain time t; e.g., see Lee, 1992). For additional information, see Weibull and Reliability/Failure Time Analysis in the Process Analysis chapter.

Reliability and Item Analysis. In many areas of research, the precise measurement of hypothesized processes or variables (theoretical constructs) poses a challenge by itself. For example, in psychology, the precise measurement of personality variables or attitudes is usually a necessary first step before any theories of personality or attitudes can be considered. In general, in all social sciences, unreliable measurements of people's beliefs or intentions will obviously hamper efforts to predict their behavior. The issue of precision of measurement will also come up in applied research, whenever variables are difficult to observe. For example, reliable measurement of employee performance is usually a difficult task; yet, it is obviously a necessary precursor to any performance-based compensation system.

In all of these cases, Reliability & Item Analysis may be used to construct reliable measurement scales, to improve existing scales, and to evaluate the reliability of scales already in use. Specifically, Reliability & Item Analysis will aid in the design and evaluation of sum scales, that is, scales that are made up of multiple individual measurements (e.g., different items, repeated measurements, different measurement devices, etc.). Reliability & Item Analysis provides numerous statistics that allow the user to build and evaluate scales following the so-called classical testing theory model.

For more information, see the Reliability and Item Analysis chapter.

The term reliability used in industrial statistics denotes a function describing the probability of failure (as a function of time). For a discussion of the concept of reliability as applied to product quality (e.g., in industrial statistics), please refer to the section on Reliability/Failure Time Analysis in the Process Analysis chapter (see also the section Repeatability and Reproducibility in the same chapter and the chapter Survival/Failure Time Analysis). For a comparison between these two (very different) concepts of reliability, see Reliability.

Resampling (in Neural Networks). A major problem with neural networks is the generalization issue (the tendency to overfit the training data), accompanied by the difficulty in quantifying likely performance on new data.

This difficulty can be disturbing if you are accustomed to the relative security of linear modeling, where a given set of data generates a single "optimal" linear model. However, this security may be somewhat deceptive, and if the underlying function is not linear, the model may be very far from optimal.

In contrast, in nonlinear modeling some choice must be made about the complexity (curvature, eccentricity) of the model, and this can lead to a plethora of alternative models. Given this diversity, it is important to have ways to estimate the performance of the models on new data, and to be able to select among them.

Most work on assessing performance in neural modeling concentrates on approaches to resampling. A neural network is optimized using a training subset. Often, a separate subset (the selection subset) is used to halt training to mitigate over-learning, or to select from a number of models trained with different parameters. Then, a third subset (the test subset) is used to perform an unbiased estimation of the network's likely performance.

Although the use of a test set allows us to generate unbiased performance estimates, these estimates may exhibit high variance. Ideally, we would like to repeat the training procedure a number of different times, each time using new training, selection and test cases drawn from the population - then, we could average the performance prediction over the different test subsets, to get a more reliable indicator of generalization performance.

In reality, we seldom have enough data to perform a number of training runs with entirely separate training, selection and test subsets. However, intuitively we might think we can do better if we train multiple networks, as when a single network is trained, only part of the data is actually involved in training. Can we find a way to use all the data in training, selection and test?

Cross validation is the most simple resampling technique. Suppose that we decide to conduct ten experiments with a given data set. We divide the data set into ten equal parts. Then, for each experiment we select one part to act as the test set. The other nine tenths of the data set are used for training and selection. When the ten experiments are finished, we can average the test set performances of the individual networks.

Cross validation has some obvious advantages. If training a single network, we would probably reserve 25% of the data for test. By using cross validation, we can reduce the individual test set size. In the most extreme version, leave-one-out cross validation, we perform a number of experiments equal to the size of the data set. On each experiment a single case is placed in the test subset, and the rest of the data is used for training. Clearly this may require a substantial number of experiments if the data set is large, but it can give you a very accurate estimate of generalization performance.

What precisely does cross validation tell us? In cross validation, each of the set of experiments should be performed with the same process parameters (same training algorithms, number of epochs, learning rates, etc.). The averaged performance measure is then an estimate of the performance on new data (drawn from the same distribution as the training data) of a single network trained using the same procedure (including the networks actually generated in the cross validation procedure).

We could select one of the cross validated networks at random and deploy it, using the estimates generated in cross validation to characterize its expected performance. However, this seems intuitively wasteful - having generated a number of networks, why not use them all? We can form the networks into an ensemble, and make predictions by averaging or voting across the resampled member networks (ensembles can also usefully combine the predictions of networks trained using different parameters, or of different architectures).

If we form an ensemble from the cross validated networks, is the performance estimate formed by averaging the test set performance of the individual networks an unbiased estimate of generalization performance?

The answer is: no. The expected performance of an ensemble is not, in general, the same as the average performance of the members. Actually, the expected performance of the ensemble is at least the average performance of the members, but usually better. Thus you can use the estimate so-formed, knowing that it is conservatively biased.

Cross validation is one technique for resampling data. There are others:

Random (Monte Carlo) resampling - the subsets are randomly sampled from the available cases. Each available case is assigned to one of the three subsets.
Bootstrapping - this technique (Efron, 1979) samples a data set with replacement (i.e. a single case may be randomly sampled several times into the bootstrap set). The bootstrap can be applied any number of times, for increased accuracy. Compared with random sampling, the use of sampling with replacement can help to iron out generalization problems caused by the finite size of the data set. Breiman (1996) suggested using the bootstrap sampling technique to train multiple models for ensemble averaging (in his case the models were decision trees, but the conclusions carry over to other models), a technique he refers to as bagging.

Residual. Residuals are differences between the observed values and the corresponding values that are predicted by the model and thus they represent the variance that is not explained by the model. The better the fit of the model, the smaller the values of residuals. The ith residual (e_i) is equal to:

e_i = (y_i - y_i-hat)

where
y_i is the ith observed value
y_i-hat is the corresponding predicted value

See, Multiple Regression. See also, standard residual value, Mahalanobis distance, deleted residual and Cook�s distance.

Resolution. An experimental design of resolution R is one in which no l-way interactions are confounded with any other interaction of order less than R - l. For example, in a design of resolution R equal to 5, no l = 2-way interactions are confounded with any other interaction of order less than R - l = 3, so main effects are unconfounded with each other, main effects are unconfounded with 2-way interactions, and 2-way interactions are unconfounded with each other. For discussions of the role of resolution in experimental design see 2**(k-p) fractional factorial designs and 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs.

Response Surface. A surface plotted in three dimensions, indicating the response of one or more variable (or a neural network) as two input variables are adjusted with the others held constant. See DOE, Neural Networks.

RMS (Root Mean Squared) Error. To calculate the RMS (root mean squared) error the individual errors are squared, added together, divided by the number of individual errors, and then square rooted. Gives a single number which summarizes the overall error. See Neural Networks.

Root Mean Square Standardized Effect (RMSSE). This standardized measure of effect size is used in the Analysis of Variance to characterize the overall level of population effects. It is the square root of the sum of squared standardized effects divided by the number of degrees of freedom for the effect. For example, in a 1-Way Anova, the RMSSE is calculated as

For more information see the chapter on Power Analysis.

Rosenbrock Pattern Search. This Nonlinear Estimation method will rotate the parameter space and align one axis with a ridge (this method is also called the method of rotating coordinates); all other axes will remain orthogonal to this axis. If the loss function is unimodal and has detectable ridges pointing towards the minimum of the function, then this method will proceed with a considerable accuracy towards the minimum of the function.

Runs Tests (in Quality Control). These tests are designed to detect patterns measurement (e.g., sample means) that may indicate that the process is out of control. In quality control charting, when a sample point (e.g., a mean in an X-bar chart) falls outside the control lines, one has reason to believe that the process may no longer be in control. In addition, one should look for systematic patterns of points (e.g., means) across samples, because such patterns may indicate that the process average has shifted. Most quality control software packages will (optionally) perform the standard set of tests for such patterns; these tests are also sometimes referred to as AT&T runs rules (see AT&T, 1959) or tests for special causes (e.g., see Nelson, 1984, 1985; Grant and Leavenworth, 1980; Shirland, 1993). The term special or assignable causes as opposed to chance or common causes was used by Shewhart to distinguish between a process that is in control, with variation due to random (chance) causes only, from a process that is out of control, with variation that is due to some non-chance or special (assignable) factors (cf. Montgomery, 1991, p. 102).

As the sigma control limits for quality control charts, the runs rules are based on "statistical" reasoning. For example, the probability of any sample mean in an X-bar control chart falling above the center line is equal to 0.5, provided (1) that the process is in control (i.e., that the center line value is equal to the population mean), (2) that consecutive sample means are independent (i.e., not auto-correlated), and (3) that the distribution of means follows the normal distribution. Simply stated, under those conditions there is a 50-50 chance that a mean will fall above or below the center line. Thus, the probability that two consecutive means will fall above the center line is equal to 0.5 times 0.5 = 0.25.

For additional information see Runs Tests; see also Assignable causes and actions.