Machine Learning. Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation. A good example of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in crossvalidation samples.

Mahalanobis distance. One can think of the independent variables (in a regression equation) as defining a multidimensional space in which each observation can be plotted. Also, one can plot a point representing the means for all independent variables. This "mean point" in the multidimensional space is also called the centroid. The Mahalanobis distance is the distance of a case from the centroid in the multidimensional space, defined by the correlated independent variables (if the independent variables are uncorrelated, it is the same as the simple Euclidean distance). Thus, this measure provides an indication of whether or not an observation is an outlier with respect to the independent variable values.

Mallow's CP. If p regressors are selected from a set of k, Cp is defined as:

S (y-yp)² / s² - n+2p

where
y_p    is the predicted value of y from the p regressors
s²    is the residual mean square after regression on the complete set of k
n      is the sample size

The model is then chosen to give a minimum value of the criterion, or a value that is acceptably small. It is essential a special case of Akaike Information Criterion. Mallow's CP is used in General Regression Models (GRM) as the criterion for choosing the best subset of predictor effects when a best subset regression analysis is being performed. This measure of the quality of fit for a model tends to be less dependent (than the R-square) on the number of effects in the model, and hence, it tends to find the best subset that includes only the important predictors of the respective dependent variable. See Best Subset Regression Options in GRM for further details.

Mann-Scheuer-Fertig Test. This test, proposed by Mann, Scheuer, and Fertig (1973), is described in detail in, for example, Dodson (1994) or Lawless (1982). The null hypothesis for this test is that the population follows the Weibull distribution with the estimated parameters. Nelson (1982) reports this test to have reasonably good power, and this test can be applied to Type II censored data. For computational details refer to Dodson (1994) or Lawless (1982); the critical values for the test statistic have been computed based on Monte Carlo studies, and have been tabulated for n (sample sizes) between 3 and 25; for n greater than 25, this test is not computed.

The Mann-Scheuer-Fertig test is used in Weibull and Reliability/Failure Time Analysis; see also, Hollander-Proschan Test and Anderson-Darling Test.

Marginal Frequencies. In a Multi-way table, the values in the margins of the table are simply one-way (frequency) tables for all values in the table. They are important in that they help us to evaluate the arrangement of frequencies in individual columns or rows. The differences between the distributions of frequencies in individual rows (or columns) and in the respective margins inform us about the relationship between the crosstabulated variables.

For more information on Marginal frequencies, see the Crosstabulations section of the Basic Statistics chapter.

Markov Chain Monte Carlo (MCMC). The term "Monte Carlo method" (suggested by John von Neumann and S. M. Ulam, in the 1940s) refers to simulation of processes, using random numbers. The term Monte Carlo (a city long known for its gambling casinos) derived from the fact that "numbers of chance" (i.e., Monte Carlo simulation methods) were used in order to solve some of the integrals of the complex equations involved in the design of the first nuclear bombs (integrals of quantum dynamics). By generating large samples of random numbers from, for example, mixtures of distributions, the integrals of these (complex) distributions can be approximated from the (generated) data.

Complex equations with difficult to solve integrals are often involved in Bayesian Statistics Analyses. For a simple example of the MCMC method for generating bivariate normal random variables, see the description of the Gibbs Sampler.

For a detailed discussion of MCMC methods, see Gilks, Richardson, and Spiegelhalter (1996). See also the description of the Gibbs Sampler, and Bayesian Statistics (Analysis).

Mass. The term mass in correspondence analysis is used to denote the entries in the two-way table of relative frequencies (i.e., each entry is divided by the sum of all entries in the table). Note that the results from correspondence analysis are still valid if the entries in the table are not frequencies, but some other measure of correspondence, association, similarity, confusion, etc. Since the sum of all entries in the table of relative frequencies is equal to 1.0, one could say that the table of relative frequencies shows how one unit of mass is distributed across the cells of the table. In the terminology of correspondence analysis, the row and column totals of the table of relative frequencies are called the row mass and column mass, respectively.

Manifest Variable. A manifest variable is a variable that is directly observable or measurable. In path analysis diagrams used in structural modeling (see Path Diagram), manifest variables are usually represented by enclosing the variable name within a square or a rectangle.

Matching Moments Method. This method can be employed to determine parameter estimates for a distribution (see Quantile- Quantile Plots, Probability-Probability Plots, and Process Analysis). The method of matching moments sets the distribution moments equal to the data moments and solves to obtain estimates for the distribution parameters. For example, for a distribution with two parameters, the first two moments of the distribution (the mean and variance of the distribution, respectively, e.g., and , respectively) would be set equal to the first two moments of the data (the sample mean and variance, respectively, e.g., the unbiased estimators x-bar and s**2, respectively) and solved for the parameter estimates. Alternatively, you could use the Maximum Likelihood Method to estimate the parameters. For more information, see Hahn and Shapiro, 1994.

Matrix Collinearity, Multicollinearity. This term is used in the context of correlation matrices or covariance matrices, to describe the condition when one or more variables from which the respective matrix was computed are linear functions of other variables; as a consequence such matrices cannot be inverted (only the generalized Inverse can be computed). See also Matrix Singularity for additional details.

Matrix Ill-conditioning. Matrix ill-conditioning is a general term used to describe a rectangular matrix of values which is unsuitable for use in a particular analysis.

This occurs perhaps most frequently in applications of linear multiple regression when the matrix of correlations for the predictors is singular and thus the regular matrix inverse cannot be computed. In some modules (i.e., in Factor Analysis) this problem is dealt with by issuing a respective warning and then artificially lowering all correlations in the correlation matrix by adding a small constant to the diagonal elements of the matrix, and then restandardizing it. This procedure will usually yield a matrix for which the regular matrix inverse can be computed.

Note that in many applications of the general linear model and the generalized linear model, matrix singularity is not abnormal (i.e., when the overparameterized model is used to represent effects for categorical predictor variables) and is dealt with by computing a generalized inverse rather than the regular matrix inverse.

Another example of matrix ill-conditioning is intransitivity of the correlations in a correlation matrix. If in a correlation matrix variable A is highly positively correlated with B, B is highly positively correlated with C, and A is highly negatively correlated with C, this "impossible" pattern of correlations signals an error in the elements of the matrix.

Matrix Inverse. The regular inverse of a rectangular matrix of values is an extension of the concept of a numeric reciprocal. For a nonsingular matrix A, its inverse (denoted by a superscript of -1) is the unique matrix that satisfies

A^-1AA=A

No such regular inverse exists for singular matrices, but generalized inverses (an infinite number of them) can be computed for any singular matrix.

Matrix Plots. Matrix graphs summarize the relationships between several variables in a matrix of true X-Y plots. The most common type of matrix plot is the scatterplot, which can be considered to be the graphical equivalent of the correlation matrix.

Matrix Plots - Columns. In this type of Matrix Plot, columns represent projections of individual data points onto the X- axis (showing the distribution of the maximum values), arranged in a matrix format. Histograms representing the distribution of each variable are displayed along the diagonal of the matrix (in square matrices, see example below) or along the edges (in rectangular matrices).

Matrix Plots - Lines. In this type of Matrix Plot, a matrix of X-Y (i.e., nonsequential) line plots (similar to a scatterplot matrix) is produced in which individual points are connected by a line in the order of their appearance in the data file. Histograms representing the distribution of each variable are displayed along the diagonal of the matrix (in square matrices) or along the edges (in rectangular matrices, see example below).

Matrix Plots - Scatterplot. In this type of Matrix Plot, 2D Scatterplots are arranged in a matrix format (values of the column variable are used as X coordinates, values of the row variable represent the Y coordinates). Histograms representing the distribution of each variable are displayed along the diagonal of the matrix (in square matrices, see example below) or along the edges (in rectangular matrices).

See also, Descriptive Statistics.

Model Profiles (in Neural Networks). Model profiles are concise text strings indicating the architecture of networks and ensembles. A profile consists of a type code followed by a code giving the number of input and output variables and number of layers and units (networks) or members (ensembles). For time series networks, the number of steps and the lookahead factor are also given. The individual parts of the profile are:

Model Type. The codes are:

MLP	Multilayer Perceptron Network
RBF	Radial Basis Function Network
SOFM	Kohonen Self-Organizing Feature Map
Linear	Linear Network
PNN	Probabilistic Neural Network
GRNN	Generalized Regression Neural Network
PCA	Principal Components Network
Cluster	Cluster Network
Output	Output Ensemble
Conf	Confidence Ensemble

Network architecture. This is of the form I:N-N-N:O, where I is the number of input variable, O the number of output variables, and N the number of units in each layer.

Example. 2:4-6-3:1 indicates a network with 2 input variables, 1 output variable, 4 input neurons, 6 hidden neurons, and 3 output neurons.

For a time series network, the steps factor is prepended to the profile, and signified by an "s."

Example. s10 1:10-2-1:1 indicates a time series network with steps factor (lagged input) 10.

Ensemble architecture. This is of the form I:[N]:O, where I is the number of input variable, O the number of output variables, and N the number of members of the ensemble.

Monte Carlo. A computer-intensive technique for assessing how a statistic will perform under repeated sampling. In Monte Carlo methods, the computer uses random number simulation techniques to mimic a statistical population. In the STATISTICA Monte Carlo procedure, the computer constructs the population according to the user’s prescription, then does the following:

For each Monte Carlo replication, the computer:

Simulates a random sample from the population,
Analyzes the sample,
Stores the results.

After many replications, the stored results will mimic the sampling distribution of the statistic. Monte Carlo techniques can provide information about sampling distributions when exact theory for the sampling distribution is not available.

MPatt Bar. Multi-pattern bar plots may be used to represent individual data values of the X variable (the same type of data as in pie charts), however, consecutive data values of the X variable are represented by the heights of sequential vertical bars, each of a different color and pattern (rather than as pie wedges of different widths).

Multidimensional Scaling. Multidimensional scaling (MDS) can be considered to be an alternative to factor analysis (see Factor Analysis), and it is typically used as an exploratory method. In general, the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher to explain observed similarities or dissimilarities (distances) between the investigated objects. In factor analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix. With MDS one may analyze not only correlation matrices but also any kind of similarity or dissimilarity matrix (including sets of measures that are not internally consistent, e.g., do not follow the rule of transitivity).

For more information, see the Multidimensional Scaling chapter.

Multilayer Perceptrons. Feedforward neural networks having linear PSP functions and (usually) non-linear activation functions.

Multimodal Distribution. A distribution that has multiple modes (thus two or more "peaks").

Multimodality of the distribution in a sample is often a strong indication that the distribution of the variable in population is not normal. Multimodality of the distribution may provide important information about the nature of the investigated variable (i.e., the measured quality). For example, if the variable represents a reported preference or attitude, then multimodality may indicate that there are several pronounced views or patterns of response in the questionnaire. Often however, the multimodality may indicate that the sample is not homogenous and the observations come in fact from two or more "overlapping" distributions. Sometimes, multimodality of the distribution may indicate problems with the measurement instrument (e.g, "gage calibration problems" in natural sciences, or "response biases" in social sciences).

Multinomial Distribution. The multinomial distribution arises when a response variable is categorical in nature, i.e., consists of data describing the membership of the respective cases to a particular category. For example, if a researcher recorded the outcome for the driver in accidents as "uninjured, "injury not requiring hospitalization", "injury requiring hospitalization", or "fatality", then the distribution of the counts in these categories would be multinomial (see Agresti, 1996). The multinomial distribution is a generalization of the binomial distribution to more than two categories.

If the categories for the response variable can be ordered, then the distribution of that variable is referred to as ordinal multinomial. For example, if in a survey the responses to a question are recorded such that respondents have to choose from the pre-arranged categories "Strongly agree", "Agree", "Neither agree nor disagree", "Disagree", and "Strongly disagree", then the counts (number of respondents) that endorsed the different categories would follow an ordinal multinomial distribution (since the response categories are ordered with respect to increasing degrees of disagreement).

Specialized methods for analyzing multinomial and ordinal multinomial response variables can be found in the Generalized Linear Models chapter.

Multinomial Logit and Probit Regression. The multinomial logit and probit regression models are extensions of the standard logit and probit regression models to the case where the dependent variable has more than two categories (e.g., not just Pass - Fail, but Pass, Fail, Withdrawn), i.e., when the dependent or response variable of interest follows a multinomial distribution rather than binomial distribution. When multinomial responses contain rank-order information, they are also called ordinal multinomial responses (see ordinal multinomial distribution).

For additional details, see also the discussion of Link Functions, Probit Transformation and Regression, Logit Transformation and Regression, or the Generalized Linear Models chapter.

Multiple Axes in Graphs. An arrangement of axes (coordinate scales) in graphs, where two or more axes are placed parallel to each other, in order to either:

represent different units in which the variable(s) depicted in the graph can be measured (e.g., a Celsius and Fahrenheit scales of temperature), or
allow for a comparison of trends or shapes between several plots placed in one graph (e.g., one axis for each plot) which otherwise would be obscured by incompatible measurement units or ranges of values for each variable (that is an extension of the common "double-Y" type of graph).

The latter instance, which requires the appropriate plot legends to be attached to each axis, is illustrated in the graph above.

Multiple Dichotomies. One possible coding scheme that can be used when more than one response is possible from a given question is to code responses using Multiple dichotomies . For example, as part of a larger market survey, suppose you asked a sample of consumers to name their three favorite soft drinks. The specific item on the questionnaire may look like this:

Write down your three favorite soft drinks:
1:__________ 2:__________ 3:__________

Suppose in the above example we were only interested in Coke, Pepsi, and Sprite. One way to code the data in that case would be as follows:

COKE PEPSI SPRITE . . . .

case 1
case 2
case 3
. . .
1

. . . 1
1

. . .

1
. . .

In other words, one variable was created for each soft drink, then a value of 1 was entered into the respective variable whenever the respective drink was mentioned by the respective respondent. Note that each variable represents a dichotomy; that is, only "1"s and "not 1"s are allowed (we could have entered 1's and 0's, but to save typing we can also simply leave the 0's as blanks or as missing values). When tabulating these variables, we would like to compute the number and percent of respondents (and responses) for each soft drink. In a sense, we "compact" the three variables Coke, Pepsi, and Sprite into a single variable (Soft Drink) consisting of multiple dichotomies.

	COKE	PEPSI	SPRITE	. . . .
case 1 case 2 case 3 . . .	1 . . .	1 1 . . .	1 . . .

For more information on Multiple dichotomies, see the Multiple Response Tables section of the Basic Statistics chapter.

Multiple Histogram. Multiple histograms present frequency distributions of more than one variable in one 2D graph. Unlike the Double-Y Histograms, the frequencies for all variables are plotted against the same left-Y axis.

Also, the values of all examined variables are plotted against a single X-axis, which facilitates comparisons between analyzed variables.

Multiple R. The coefficient of multiple correlation (Multiple R) is the positive square root of R-square (the coefficient of multiple determination, see Residual Variance and R-Square). This statistic is useful in multivariate regression (i.e., multiple independent variables) when you want to describe the relationship between the variables.

Multiple Regression. The general purpose of multiple regression (the term was first used by Pearson, 1908) is to analyze the relationship between several independent or predictor variables and a dependent or criterion variable.

The computational problem that needs to be solved in multiple regression analysis is to fit a straight line (or plane in an n-dimensional space, where n is the number of independent variables) to a number of points. In the simplest case -- one dependent and one independent variable -- one can visualize this in a scatterplot (scatterplots are two-dimensional plots of the scores on a pair of variables). It is used as either a hypothesis testing or exploratory method.

For more information, see the Multiple Regression chapter.

Multiple Response Variables. Coding the responses to Multiple response variables is necessary when more than one response is possible from a given question. For example, as part of a larger market survey, suppose you asked a sample of consumers to name their three favorite soft drinks. The specific item on the questionnaire may look like this:

Write down your three favorite soft drinks:
1:__________ 2:__________ 3:__________

Thus, the questionnaires returned to you will contain somewhere between 0 and 3 answers to this item. Also, a wide variety of soft drinks will most likely be named. One way to record the various responses would be to use three multiple response variables and a coding scheme for the many soft drinks. Then we could enter the respective codes (or alphanumeric labels) into the three variables, in the same way that respondents wrote them down in the questionnaire.

Resp. 1 Resp. 2 Resp. 3

case 1
case 2
case 3
. . . COKE
SPRITE
PERRIER
. . . PEPSI
SNAPPLE
GATORADE
. . . JOLT
DR. PEPPER
MOUNTAIN DEW
. . .

For more information, see the Multiple Response Tables section of the Basic Statistics chapter.

	Resp. 1	Resp. 2	Resp. 3
case 1 case 2 case 3 . . .	COKE SPRITE PERRIER . . .	PEPSI SNAPPLE GATORADE . . .	JOLT DR. PEPPER MOUNTAIN DEW . . .

Multiple-response Tables. Multiple-response tables are Crosstabulation tables used when the categories of interest are not mutually exclusive. Such tables can accommodate Multiple response variables as well as Multiple dichotomies.

For more information, see the Multiple Response Tables section of the Basic Statistics chapter.

Multiplicative Season, Damped Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by a damped trend component (independently smoothed with the single parameter ; this model is an extension of Brown's one-parameter linear model, see Gardner, 1985, p. 12-13) and a multiplicative seasonal component (smoothed with parameter ). For example, suppose we wanted to forecast from month to month the number of households that purchase a particular consumer electronics device (e.g., VCR). Every year, the number of households that purchase a VCR will increase, however, this trend will be damped (i.e., the upward trend will slowly disappear) over time as the market becomes saturated. In addition, there will be a seasonal component, reflecting the seasonal changes in consumer demand for VCR's from month to month (demand will likely be smaller in the summer and greater during the December holidays). This seasonal component may be multiplicative, for example, sales during the December holidays may increase by factor of 1.4 (or 40%) over the average annual sales. To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of S₀ and T₀ (initial trend) are necessary. These values are computed as:

T₀ = (1/)*M_k-M₁)/[(k-1)*p]

where
     is the smoothing parameter
k       is the number of complete seasonal cycles
M_k    is the mean for the last seasonal cycle
M₁    is the mean for the first seasonal cycle
p       is the length of the seasonal cycle
and S₀ = M₁-p*T₀/2

Multiplicative Season, Exponential Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by an exponential trend component (independently smoothed with parameter ) and a multiplicative seasonal component (smoothed with parameter ). For example, suppose we wanted to forecast the monthly revenue for a resort area. Every year, revenue may increase by a certain percentage or factor, resulting in an exponential trend in overall revenue. In addition, there could be an multiplicative seasonal component, that is, given the respective annual revenue, each year 20% of the revenue is produced during the month of December, that is, during Decembers the revenue grows by a particular (multiplicative) factor.

To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of S₀ and T₀ (initial trend) are necessary. By default, these values are computed as:

T₀ = exp{[log(M₂)-log(M₁)]/p}

where
M₂    is the mean for the second seasonal cycle
M₁    is the mean for the first seasonal cycle
p       is the length of the seasonal cycle
and S₀ = exp{log(M₁)-p*log(T₀)/2}

Multiplicative Season, Linear Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by a linear trend component (independently smoothed with parameter ) and a multiplicative seasonal component (smoothed with parameter ). For example, suppose we were to predict the monthly budget for snow-removal in a community. There may be a trend component (as the community grows, there is an upward trend for the cost of snow removal from year to year). At the same time, there is obviously a seasonal component, reflecting the differential likelihood of snow during different months of the year. This seasonal component could be multiplicative, meaning that given a respective budget figure, it may increase by a factor of, for example, 1.4 during particular winter months; or it may be additive (see above), that is, a particular fixed additional amount of money is necessary during the winter months. To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of S₀ and T₀ (initial trend) are necessary. By default, these values are computed as:

T₀ = (M_k-M₁)/((k-1)*p)

where
k       is the number of complete seasonal cycles
M_k    is the mean for the last seasonal cycle
M₁    is the mean for the first seasonal cycle
p       is the length of the seasonal cycle
and S₀ = M₁ - T₀/2

Multiplicative Season, No Trend. This Time Series model is partially equivalent to the simple exponential smoothing model; however, in addition, each forecast is "enhanced" by a multiplicative component that is smoothed independently (see The seasonal smoothing parameter ). This model would, for example, be adequate when computing forecasts for monthly expected sales for a particular toy. The level of sales may be stable from year to year, or change only slowly; at the same time, there will be seasonal changes (e.g., greater sales during the December holidays), which again may change slowly from year to year. The seasonal changes may affect the sales in a multiplicative fashion, for example, depending on the respective overall level of sales, December sales may always be greater by a factor of 1.4.