The multiple imputation FAQ page

This page provides basic information about multiple imputation (MI) in the form of answers to Frequently Asked Questions (FAQs). For more extensive, non-technical overviews of MI, see the articles by Schafer and Olsen (1987) and Schafer (1999). Answers to FAQs regarding MI in large public-use data sets (e.g. from surveys and censuses) are given by Rubin (1996).

What is multiple imputation?

Imputation, the practice of 'filling in' missing data with plausible values, is an attractive approach to analyzing incomplete data. It apparently solves the missing-data problem at the beginning of the analysis. However, a naive or unprincipled imputation method may create more problems than it solves, distorting estimates, standard errors and hypothesis tests, as documented by Little and Rubin (1987) and others.

The question of how to obtain valid inferences from imputed data was addressed by Rubin's (1987) book on multiple imputation (MI). MI is a Monte Carlo technique in which the missing values are replaced by m>1 simulated versions, where m is typically small (e.g. 3-10). In Rubin's method for `repeated imputation' inference, each of the simulated complete datasets is analyzed by standard methods, and the results are combined to produce estimates and confidence intervals that incorporate missing-data uncertainty. Rubin (1987) addresses potential uses of MI primarily for large public-use data files from sample surveys and censuses. With the advent of new computational methods and software for creating MI's, however, the technique has become increasingly attractive for researchers in the biomedical, behavioral, and social sciences whose investigations are hindered by missing data. These methods are documented in a recent book by Schafer (1997) on incomplete multivariate data.

Is MI the only principled way to handle missing data?

MI is not the only principled method for handling missing values, nor is it necessarily the best for any given problem. In some cases, good estimates can be obtained through weighted estimation procedures. In fully parametric models, maximum-likelihood estimates can often be calculated directly from the incomplete data by specialized numerical methods, such as the EM algorithm. Those procedures may be somewhat more efficient than MI because they involve no simulation. Given sufficient time and resources, one could perhaps derive a better statistical procedure than MI for any particular problem. In real-life applications, however, where missing data are nuisance rather than a the primary focus, an easy, approximate solution with good properties can be preferable to one that is more efficient but problem-specific and complicated to implement.

Why are only a few imputations needed?

Many are surprised by the claim that only 3-10 imputations may be needed. Rubin (1987, p. 114) shows that the efficiency of an estimate based on m imputations is approximately

where is the rate of missing information for the quantity being estimated. The efficiencies achieved for various values of m and rates of missing information are shown below.

Unless the rate of missing information is very high, In most situations there is simply little advantage to producing and analyzing more than a few imputed datasets.

See how the rate of missing information is estimated.

How does one create multiple imputations?

Except in trivial settings, the probability distributions that one must draw from to produce proper MI's tend to be complicated and intractable. Recently, however, a variety of new simulation methods have appeared in the statistical literature. These methods, known as Markov chain Monte Carlo (MCMC), have spawned a small revolution in Bayesian analysis and applied parametric modeling (Gilks, Richardson & Spiegelhalter, 1996). Schafer (1997) has adapted and implemented MCMC methods for the purpose of multiple imputation. In particular, he has written general-purpose MI software for incomplete multivariate data. They may be downloaded free of charge at our website.

The imputation model

In order to generate imputations for the missing values, one must impose a probability model on the complete data (observed and missing values). Each of our software packages applies a different class of multivariate complete-data models. NORM uses the multivariate normal distribution. CAT is based on loglinear models, which have been traditionally used by social scientists to describe associations among variables in cross-classified data. The MIX program relies on the general location model, which combines a loglinear model for the categorical variables with a multivariate normal regression for the continuous ones. Details of these models are given by Schafer (1997). The newer package PAN uses a multivariate extension of a popular two-level linear regression model commonly applied to multilevel data (e.g. Bryk & Raudenbush, 1992). The PAN model is appropriate for describing multiple variables collected on a sample of individuals over time, or multiple variables collected on individuals who are grouped together into larger units (e.g. students within classrooms).

What if the imputation model is wrong?

Experienced analysts know that real data rarely conform to convenient models such as the multivariate normal. In most applications of MI, the model used to generate the imputations will at best be only approximately true. Fortunately, experience has repeatedly shown that MI tends to be quite forgiving of departures from the imputation model. For example, when working with binary or ordered categorical variables, it is often acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. Variables whose distributions are heavily skewed may be transformed (e.g. by taking logarithms) to approximate normality and then transformed back to their original scale after imputation.

What is the relationship between the model used for imputation and the model used for analysis?

An imputation model should be chosen to be (at least approximately) compatible with the analyses to be performed on the imputed datasets. The imputation model should be rich enough to preserve the associations or relationships among variables that will be the focus of later investigation. For example, suppose that a variable Y is imputed under a normal model that includes the variable X1. After imputation, the analyst then uses linear regression to predict Y from X1 and another variable X2 which was not in the imputation model. The estimated coefficient for X2 from this regression would tend to be biased toward zero, because Y has been imputed without regard for its possible relationship with X2. In general, any association that may prove important in subsequent analyses should be present in the imputation model.

The converse of this rule, however, is not necessary. If Y has been imputed under a model that includes X2, there is no need to include X2 in future analyses involving Y unless its relationship to Y is of substantive interest. Results pertaining to Y are not biased by the inclusion of extra variables in the imputation phase. Therefore, a rich imputation model that preserves a large number of associations is desirable because it may be used for a variety of post-imputation analyses.

Detailed discussion on the interrelationships between the model used for imputation and the model used for analysis is given by Meng (1995) and Rubin (1996).

Above all, the processes of imputation and analysis should be guided by common sense. For example, suppose that variables with skewed, truncated, or heavy-tailed distributions are, for the sake of convenience, imputed under an assumption of joint normality. Analyses that depend primarily on means, variances, and covariances, such as regression or principal-component methods, should perform reasonably well even though the imputer's model is too simplistic. On the other hand, common sense would suggest that the same imputations ought not be used for estimation of 5th or 95th percentiles, or other analyses sensitive to non-normal shape.

How do I combine the results across the multiply imputed sets of data?

Rubin (1987) presented this method for combining results from a data analysis performed m times, once for each of m imputed data sets, to obtain a single set of results.

From each analysis, one must first calculate and save the estimates and standard errors. Suppose that is an estimate of a scalar quantity of interest (e.g. a regression coefficient) obtained from data set j (j=1,2,...,m) and is the standard error associated with . The overall estimate is the average of the individual estimates,

For the overall standard error, one must first calculate the within-imputation variance,

and the between-imputation variance,

The total variance is

The overall standard error is the square root of T. Confidence intervals are obtained by taking the overall estimate plus or minus a number of standard errors, where that number is a quantile of Student's t-distribution with degrees of freedom

A significance test of the null hypothesis Q=0 is performed by comparing the ratio to the same t-distribution. Additional methods for combining the results from multiply imputed data are reviewed by Schafer (1997, Ch. 4).

What is the rate of missing information?

When performing a multiply-imputed analysis, the variation in results across the imputed data sets reflects statistical uncertainty due to missing data. Rubin's (1987) rules for MI inference provide some diagnostic measures that indicate how strongly the quantity being estimated is influenced by missing data. The estimated rate of missing information is

where

is the relative increase in variance due to nonresponse.

The rate of missing information, together with the number of imputations m, determines the relative efficiency of the MI inference; see Why are only a few imputations needed?

Is multiple imputation a Bayesian procedure?

Partly yes and partly no. When imputations are created under Bayesian arguments (and they usually are), MI has a natural interpretation as an approximate Bayesian inference for the quantities of interest based on the observed data. The validity of MI, however, does not require one to fully subscribe to the Bayesian paradigm. Rubin (1987) provides technical conditions under which MI leads to frequency-valid answers. An imputation method which satisfies these conditions is said to be 'proper.'

Rubin's definition of 'proper', like many frequentist criteria, are useful for evaluating the properties of a given method but provide little guidance for one seeking to create such a method in practice. For this reason, Rubin recommends that imputations be created through a Bayesian process: specify a parametric model for the complete data (and, if necessary, a model for the mechanism by which data become missing), apply a prior distribution to the unknown model parameters, and simulate m independent draws from the conditional distribution of the missing data given the observed data by Bayes' Theorem. In simple problems, the computations necessary for creating MI's can be performed explicitly through formulas. In non-trivial applications, special computational techniques such as Markov chain Monte Carlo (MCMC) must be applied.

Removing incomplete cases is so much easier than multiple imputation; why can't I just do that?

The shortcomings of various case-deletion strategies have been well documented (e.g. Little & Rubin, 1987). If the discarded cases form a representative and relatively small portion of the entire dataset, then case deletion may indeed be a reasonable approach. However, case deletion leads to valid inferences in general only when missing data are missing completely at random in the sense that the probabilities of response do not depend on any data values observed or missing. In other words, case deletion implicitly assumes that the discarded cases are like a random subsample. When the discarded cases differ systematically from the rest, estimates may be seriously biased. Moreover, in multivariate problems, case deletion often results in a large portion of the data being discarded and an unacceptable loss of power.

Why can't I just impute once?

If the proportion of missing values is small, then single imputation may be quite reasonable. Without special corrective measures, single-imputation inference tends to overstate precision because it omits the between-imputation component of variability. When the fraction of missing information is small (say, less than 5%) then single-imputation inferences for a scalar estimand may be fairly accurate. For joint inferences about multiple parameters, however, even small rates of missing information may seriously impair a single-imputation procedure. In modern computing environments, the effort required to produce and analyze a multiply-imputed dataset is often not substantially greater than what is required for good single imputation.

Is multiple imputation like EM?

MI bears a close resemblance to the EM algorithm and other computational methods for calculating maximum-likelihood estimates based on the observed data alone. These methods summarize a likelihood function which has been averaged over a predictive distribution for the missing values. MI performs this same type of averaging by Monte Carlo rather than by numerical methods. In large samples, when relevant aspects of the imputer's and analyst's models agree, inferences obtained by MI with sufficiently many imputations will be nearly the same as those obtained by direct maximization of the likelihood.

Is multiple imputation related to MCMC?

Markov chain Monte Carlo (MCMC) is a collection of methods for simulating random draws from nonstandard distributions via Markov chains. MCMC is one of the primary methods for generating MI's in nontrivial problems. In much of the existing literature on MCMC (e.g. Gilks, Richardson & Spiegelhalter, 1996, and their references) MCMC is used for parameter simulation, for creating a large number of (typically dependent) random draws of parameters from Bayesian posterior distributions under complicated parametric models. In MI-related applications, however, MCMC is used to create a small number of independent draws of the missing data from a predictive distribution, and these draws are then used for multiple-imputation inference. In many cases it is possible to conduct an analysis either by parameter simulation or by multiple imputation. Parameter simulation tends to work well when interest is confined to small number of well-defined parameters, whereas multiple imputation is more attractive for exploratory or multi-purpose analyses involving a large number of estimands. Generating and storing 10 versions of the missing data is often more efficient than generating and storing the hundreds or thousands of dependent draws that would be required to achieve a comparable degree of precision through parameter simulation.

What if the missing data are not 'missing at random'?

Most of the techniques presently available for creating MI's assume that the missing values are 'missing at random' (MAR) in the sense defined by Rubin (1976) and Little and Rubin (1987). That is, they assume that missing data values carry no information about probabilities of missingness. This assumption is mathematically convenient because it allows one to eschew an explicit probability model for nonresponse. In some applications, however, ignorability may seem artificial or implausible. With attrition in a longitudinal study, for example, it is possible that subjects drop out for reasons related to current data values. It is important to note that the MI paradigm does not require or assume that nonresponse is ignorable. Imputations may in principle be created under any kind of assumptions or model for the missing-data mechanism, and the resulting inferences will be valid under that mechanism.

Isn't multiple imputation just making up data?

When MI is presented to a new audience, some may view it as a kind of statistical alchemy in which information is somehow invented or created out of nothing. This objection is quite valid for single-imputation methods, which treat imputed values no differently from observed ones. MI, however, is nothing more than a device for representing missing-data uncertainty. Information is not being invented with MI any more than with EM or other well accepted likelihood-based methods, which average over a predictive distribution for the missing data by numerical techniques rather than by simulation.

References

Bryk, A.S. and Raudenbush, S.W. (1992) Hierarchical Linear Models. Sage, Newbury Park.

Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. (Eds.) (1996) Markov Chain Monte Carlo in Practice. Chapman & Hall, London.

Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data. J. Wiley & Sons, New York.

Meng, X.L.(1995) Multiple-imputation inferences with uncongenial sources of input (with discussion). Statistical Science, 10, 538-573.

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, 581-592.

Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York.

Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Association, 91, 473-489.

Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, London.

Schafer, J.L. (1999) Multiple imputation: a primer. Statistical Methods in Medical Research, in press.

Schafer, J.L. and Olsen, M.K. (1998)