- What is multiple imputation?
- Is MI the only principled way to handle missing data?
- Why are only a few imputations needed?
- How does one create multiple imputations?
- The imputation model
- What if the imputation model is wrong?
- What is the relationship between the model used for imputation and the model used for analysis?
- How do I combine the results across the multiply imputed sets of data?
- What is the rate of missing information?
- Is multiple imputation a Bayesian procedure?
- Removing incomplete cases is so much easier than multiple imputation; why can't I just do that?
- Why can't I just impute once?
- Is multiple imputation like EM?
- Is multiple imputation related to MCMC?
- What if the missing data are not 'missing at random'?
- Isn't multiple imputation just making up data?
- References

The question of how to obtain valid inferences from imputed data was
addressed by Rubin's (1987) book on multiple
imputation (MI). MI is a Monte Carlo technique in which the missing
values are replaced by *m*>1 simulated versions, where
*m* is typically small (e.g. 3-10). In Rubin's method for
`repeated imputation' inference, each of the simulated complete
datasets is analyzed by standard methods, and the results are combined
to produce estimates and confidence intervals that incorporate
missing-data uncertainty. Rubin (1987) addresses
potential uses of MI primarily for large public-use data files from
sample surveys and censuses. With the advent of new computational
methods and software for creating MI's, however, the technique has
become increasingly attractive for researchers in the biomedical,
behavioral, and social sciences whose investigations are hindered by
missing data. These methods are documented in a recent book by Schafer (1997) on incomplete multivariate data.

where is the rate of missing
information for the quantity being estimated.
The efficiencies achieved for various
values of *m* and rates of missing information are shown below.

Unless the rate of missing information is very high, In most situations there is simply little advantage to producing and analyzing more than a few imputed datasets.

See how the rate of missing information is estimated.

The converse of this rule, however, is not necessary. If *Y*
has been imputed under a model that includes *X2*, there is no
need to include *X2* in future analyses involving *Y*
unless its relationship to *Y* is of substantive
interest. Results pertaining to *Y* are not biased by the
inclusion of extra variables in the imputation phase. Therefore, a
rich imputation model that preserves a large number of associations is
desirable because it may be used for a variety of post-imputation
analyses.

Detailed discussion on the interrelationships between the model used for imputation and the model used for analysis is given by Meng (1995) and Rubin (1996).

Above all, the processes of imputation and analysis should be guided by common sense. For example, suppose that variables with skewed, truncated, or heavy-tailed distributions are, for the sake of convenience, imputed under an assumption of joint normality. Analyses that depend primarily on means, variances, and covariances, such as regression or principal-component methods, should perform reasonably well even though the imputer's model is too simplistic. On the other hand, common sense would suggest that the same imputations ought not be used for estimation of 5th or 95th percentiles, or other analyses sensitive to non-normal shape.

From each analysis, one must first calculate and save the estimates
and standard errors. Suppose that
is an estimate of a scalar quantity of interest (e.g. a regression
coefficient) obtained from data set *j*
(*j*=1,2,...,*m*) and is the
standard error associated with . The
overall estimate is the average of the individual estimates,

For the overall standard error, one must first calculate the within-imputation variance,

and the between-imputation variance,

The total variance is

The overall standard error is the square root of *T*. Confidence
intervals are obtained by taking the overall estimate plus or minus a
number of standard errors, where that number is a quantile of
Student's t-distribution with degrees of freedom

A significance test of the null hypothesis *Q*=0 is performed
by comparing the ratio to the same
t-distribution.
Additional methods for combining the results from multiply imputed
data are reviewed by Schafer (1997, Ch. 4).

where

is the relative increase in variance due to nonresponse.

The rate of missing information, together with the number of
imputations *m*, determines the relative efficiency of the MI
inference; see Why are only a few imputations
needed?

Rubin's definition of 'proper', like many frequentist criteria, are
useful for evaluating the properties of a given method but provide
little guidance for one seeking to create such a method in practice.
For this reason, Rubin recommends that imputations be created
through a Bayesian process: specify a parametric model for the
complete data (and, if necessary, a model for the mechanism by which
data become missing), apply a prior distribution to the unknown model
parameters, and simulate *m* independent draws from the conditional
distribution of the missing data given the observed data by Bayes'
Theorem. In simple problems, the computations necessary for creating
MI's can be performed explicitly through formulas. In non-trivial
applications, special computational techniques such as Markov
chain Monte Carlo (MCMC) must be applied.

Bryk, A.S. and Raudenbush, S.W. (1992) *Hierarchical
Linear Models*. Sage, Newbury Park.

Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. (Eds.)
(1996) *Markov Chain Monte Carlo in Practice*. Chapman & Hall,
London.

Little, R.J.A. and Rubin, D.B. (1987) *Statistical
Analysis with Missing Data*. J. Wiley & Sons, New York.

Meng, X.L.(1995) Multiple-imputation inferences with uncongenial
sources of input (with discussion). *Statistical Science*,
10, 538-573.

Rubin, D.B. (1976) Inference and missing data. *Biometrika*,
63, 581-592.

Rubin, D.B. (1987) *Multiple Imputation for Nonresponse in
Surveys*. J. Wiley & Sons, New York.

Rubin, D.B. (1996) Multiple imputation after 18+ years
(with discussion). *Journal of the American Statistical
Association*, 91, 473-489.

Schafer, J.L. (1997) * Analysis of Incomplete Multivariate
Data*. Chapman & Hall, London.

Schafer, J.L. (1999) Multiple imputation: a primer. *Statistical
Methods in Medical Research*, in press.

Schafer, J.L. and Olsen, M.K. (1998)

Go to software for multiple imputation

Go to Joe Schafer's home page