The multiple imputation FAQ page
This page provides basic information about multiple imputation (MI) in
the form of answers to Frequently Asked Questions (FAQs). For more
extensive, non-technical overviews of MI, see the articles by Schafer and Olsen (1987) and Schafer
(1999). Answers to FAQs regarding MI in large public-use data sets
(e.g. from surveys and censuses) are given by Rubin
(1996).
What is multiple imputation?
Imputation, the practice of 'filling in' missing data with plausible
values, is an attractive approach to analyzing incomplete data. It
apparently solves the missing-data problem at the beginning of the
analysis. However, a naive or unprincipled imputation method may
create more problems than it solves, distorting estimates, standard
errors and hypothesis tests, as documented by Little
and Rubin (1987) and others.
The question of how to obtain valid inferences from imputed data was
addressed by Rubin's (1987) book on multiple
imputation (MI). MI is a Monte Carlo technique in which the missing
values are replaced by m>1 simulated versions, where
m is typically small (e.g. 3-10). In Rubin's method for
`repeated imputation' inference, each of the simulated complete
datasets is analyzed by standard methods, and the results are combined
to produce estimates and confidence intervals that incorporate
missing-data uncertainty. Rubin (1987) addresses
potential uses of MI primarily for large public-use data files from
sample surveys and censuses. With the advent of new computational
methods and software for creating MI's, however, the technique has
become increasingly attractive for researchers in the biomedical,
behavioral, and social sciences whose investigations are hindered by
missing data. These methods are documented in a recent book by Schafer (1997) on incomplete multivariate data.
Back to top
Is MI the only principled way to handle
missing data?
MI is not the only principled method for
handling missing values, nor is it necessarily the best for any given
problem. In some cases, good estimates can be obtained through
weighted estimation procedures. In fully parametric models,
maximum-likelihood estimates can often be calculated directly from the
incomplete data by specialized numerical methods, such as the EM
algorithm. Those procedures may be somewhat more efficient than MI
because they involve no simulation. Given sufficient time and
resources, one could perhaps derive a better statistical procedure
than MI for any particular problem. In real-life applications,
however, where missing data are nuisance rather than a the primary
focus, an easy, approximate solution with good properties can be
preferable to one that is more efficient but problem-specific and
complicated to implement.
Back to top
Why are only a few imputations
needed?
Many are surprised by the claim that only 3-10 imputations may be
needed. Rubin (1987, p. 114) shows that the efficiency of an estimate
based on m imputations is approximately
where
is the rate of missing
information for the quantity being estimated.
The efficiencies achieved for various
values of m and rates of missing information are shown below.
Unless the rate of missing information is very high, In most
situations there is simply little advantage to producing and analyzing
more than a few imputed datasets.
See how the rate of missing information is estimated.
Back to top
How does one create multiple
imputations?
Except in trivial settings, the probability
distributions that one must draw from to produce proper MI's tend to
be complicated and intractable. Recently, however, a variety of new
simulation methods have appeared in the statistical literature. These
methods, known as Markov chain Monte Carlo (MCMC), have spawned a
small revolution in Bayesian analysis and applied parametric modeling
(Gilks, Richardson & Spiegelhalter, 1996).
Schafer (1997) has adapted
and implemented MCMC methods for the purpose of multiple
imputation. In particular, he has written general-purpose MI software
for incomplete multivariate data. They may be downloaded free of
charge at our website.
Back to top
The imputation model
In order to generate imputations for the missing values, one must
impose a probability model on the complete data (observed and missing
values). Each of our software packages applies a different class of
multivariate complete-data models. NORM uses the multivariate normal
distribution. CAT is based on loglinear models, which have been
traditionally used by social scientists to describe associations among
variables in cross-classified data. The MIX program relies on the
general location model, which combines a loglinear model for
the categorical variables with a multivariate normal regression for
the continuous ones. Details of these models are given by Schafer (1997).
The newer package PAN uses a multivariate extension of a
popular two-level linear regression model commonly applied to
multilevel data (e.g. Bryk & Raudenbush, 1992).
The PAN model is
appropriate for describing multiple variables collected on a sample of
individuals over time, or multiple variables collected on individuals
who are grouped together into larger units (e.g. students within
classrooms).
Back to top
What if the imputation model is
wrong?
Experienced analysts know that real data rarely conform to convenient
models such as the multivariate normal. In most applications of MI,
the model used to generate the imputations will at best be only
approximately true. Fortunately, experience has repeatedly shown that
MI tends to be quite forgiving of departures from the imputation
model. For example, when working with binary or ordered categorical
variables, it is often acceptable to impute under a normality
assumption and then round off the continuous imputed values to the
nearest category. Variables whose distributions are heavily skewed
may be transformed (e.g. by taking logarithms) to approximate
normality and then transformed back to their original scale after
imputation.
Back to top
What is the relationship between the
model used for imputation and the model used for analysis?
An imputation model should be chosen to be (at least approximately)
compatible with the analyses to be performed on the imputed datasets.
The imputation model should be rich enough to preserve the
associations or relationships among variables that will be the focus
of later investigation. For example, suppose that a variable
Y is imputed under a normal model that includes the variable
X1. After imputation, the analyst then uses linear
regression to predict Y from X1 and another variable
X2 which was not in the imputation model. The estimated
coefficient for X2 from this regression would tend to be
biased toward zero, because Y has been imputed without regard
for its possible relationship with X2. In general, any
association that may prove important in subsequent analyses should be
present in the imputation model.
The converse of this rule, however, is not necessary. If Y
has been imputed under a model that includes X2, there is no
need to include X2 in future analyses involving Y
unless its relationship to Y is of substantive
interest. Results pertaining to Y are not biased by the
inclusion of extra variables in the imputation phase. Therefore, a
rich imputation model that preserves a large number of associations is
desirable because it may be used for a variety of post-imputation
analyses.
Detailed discussion on the interrelationships between the model used
for imputation and the model used for analysis is given by Meng (1995) and Rubin (1996).
Above all, the processes of imputation and analysis should be guided
by common sense. For example, suppose that variables with skewed,
truncated, or heavy-tailed distributions are, for the sake of
convenience, imputed under an assumption of joint normality. Analyses
that depend primarily on means, variances, and covariances, such as
regression or principal-component methods, should perform reasonably
well even though the imputer's model is too simplistic. On the other
hand, common sense would suggest that the same imputations ought not
be used for estimation of 5th or 95th percentiles, or other analyses
sensitive to non-normal shape.
Back to top
How do I combine the results across the
multiply imputed sets of data?
Rubin (1987) presented this method for combining
results from a data analysis performed m times, once for each
of m imputed data sets, to obtain a single set of results.
From each analysis, one must first calculate and save the estimates
and standard errors. Suppose that
is an estimate of a scalar quantity of interest (e.g. a regression
coefficient) obtained from data set j
(j=1,2,...,m) and
is the
standard error associated with
. The
overall estimate is the average of the individual estimates,
For the overall standard error, one must first calculate the
within-imputation variance,
and the between-imputation variance,
The total variance is
The overall standard error is the square root of T. Confidence
intervals are obtained by taking the overall estimate plus or minus a
number of standard errors, where that number is a quantile of
Student's t-distribution with degrees of freedom
A significance test of the null hypothesis Q=0 is performed
by comparing the ratio
to the same
t-distribution.
Additional methods for combining the results from multiply imputed
data are reviewed by Schafer (1997, Ch. 4).
Back to top
What is the rate of missing
information?
When performing a multiply-imputed analysis, the variation in results
across the imputed data sets reflects statistical uncertainty due to
missing data. Rubin's (1987) rules for MI
inference provide some diagnostic measures that indicate how
strongly the quantity being estimated is influenced by missing
data. The estimated rate of missing information is
where
is the relative increase in variance due to nonresponse.
The rate of missing information, together with the number of
imputations m, determines the relative efficiency of the MI
inference; see Why are only a few imputations
needed?
Back to top
Is multiple imputation a Bayesian
procedure?
Partly yes and partly no. When imputations are created under Bayesian
arguments (and they usually are), MI has a natural interpretation as
an approximate Bayesian inference for the quantities of interest based
on the observed data. The validity of MI, however, does not require
one to fully subscribe to the Bayesian paradigm. Rubin
(1987) provides technical conditions under which MI leads to
frequency-valid answers. An imputation method which satisfies these
conditions is said to be 'proper.'
Rubin's definition of 'proper', like many frequentist criteria, are
useful for evaluating the properties of a given method but provide
little guidance for one seeking to create such a method in practice.
For this reason, Rubin recommends that imputations be created
through a Bayesian process: specify a parametric model for the
complete data (and, if necessary, a model for the mechanism by which
data become missing), apply a prior distribution to the unknown model
parameters, and simulate m independent draws from the conditional
distribution of the missing data given the observed data by Bayes'
Theorem. In simple problems, the computations necessary for creating
MI's can be performed explicitly through formulas. In non-trivial
applications, special computational techniques such as Markov
chain Monte Carlo (MCMC) must be applied.
Back to top
Removing incomplete cases is so much
easier than multiple imputation; why can't I just do that?
The shortcomings of various case-deletion strategies have been well
documented (e.g. Little & Rubin, 1987). If the
discarded cases form a representative and relatively small portion of
the entire dataset, then case deletion may indeed be a reasonable
approach. However, case deletion leads to valid inferences in general
only when missing data are missing completely at random in the sense
that the probabilities of response do not depend on any data values
observed or missing. In other words, case deletion implicitly assumes
that the discarded cases are like a random subsample. When the
discarded cases differ systematically from the rest, estimates may be
seriously biased. Moreover, in multivariate problems, case deletion
often results in a large portion of the data being discarded and an
unacceptable loss of power.
Back to top
Why can't I just impute once?
If the proportion of missing values is small, then single
imputation may be quite reasonable. Without special corrective
measures, single-imputation inference tends to overstate precision
because it omits the between-imputation component of variability. When
the fraction of missing information is small (say, less than 5%) then
single-imputation inferences for a scalar estimand may be fairly
accurate. For joint inferences about multiple parameters, however,
even small rates of missing information may seriously impair a
single-imputation procedure. In modern computing environments, the
effort required to produce and analyze a multiply-imputed dataset is
often not substantially greater than what is required for good single
imputation.
Back to top
Is multiple imputation like EM?
MI bears a close resemblance to the EM algorithm and other
computational methods for calculating maximum-likelihood estimates
based on the observed data alone. These methods summarize a likelihood
function which has been averaged over a predictive distribution for
the missing values. MI performs this same type of averaging by Monte
Carlo rather than by numerical methods. In large samples, when
relevant aspects of the imputer's and analyst's models agree,
inferences obtained by MI with sufficiently many imputations will be
nearly the same as those obtained by direct maximization of the
likelihood.
Back to top
Is multiple imputation related to
MCMC?
Markov chain Monte Carlo (MCMC) is a collection of methods for
simulating random draws from nonstandard distributions via Markov
chains. MCMC is one of the primary methods for generating MI's in
nontrivial problems. In much of the existing literature on MCMC
(e.g. Gilks, Richardson & Spiegelhalter, 1996, and
their references) MCMC is used for parameter simulation, for creating
a large number of (typically dependent) random draws of parameters
from Bayesian posterior distributions under complicated parametric
models. In MI-related applications, however, MCMC is used to create a
small number of independent draws of the missing data from a
predictive distribution, and these draws are then used for
multiple-imputation inference. In many cases it is possible to conduct
an analysis either by parameter simulation or by multiple imputation.
Parameter simulation tends to work well when interest is confined to
small number of well-defined parameters, whereas multiple imputation
is more attractive for exploratory or multi-purpose analyses involving
a large number of estimands. Generating and storing 10 versions of the
missing data is often more efficient than generating and storing the
hundreds or thousands of dependent draws that would be required to
achieve a comparable degree of precision through parameter simulation.
Back to top
What if the missing data are not
'missing at random'?
Most of the techniques presently available for creating MI's assume
that the missing values are 'missing at random' (MAR) in the sense
defined by Rubin (1976) and Little
and Rubin (1987). That is, they assume that missing data values
carry no information about probabilities of missingness. This
assumption is mathematically convenient because it allows one to
eschew an explicit probability model for nonresponse. In some
applications, however, ignorability may seem artificial or
implausible. With attrition in a longitudinal study, for example, it
is possible that subjects drop out for reasons related to current data
values. It is important to note that the MI paradigm does not require
or assume that nonresponse is ignorable. Imputations may in principle
be created under any kind of assumptions or model for the missing-data
mechanism, and the resulting inferences will be valid under that
mechanism. Back to top
Isn't multiple imputation just making up
data?
When MI is presented to a new audience, some may view it as a kind of
statistical alchemy in which information is somehow invented or
created out of nothing. This objection is quite valid for
single-imputation methods, which treat imputed values no differently
from observed ones. MI, however, is nothing more than a device for
representing missing-data uncertainty. Information is not being
invented with MI any more than with EM or other well accepted
likelihood-based methods, which average over a predictive distribution
for the missing data by numerical techniques rather than by simulation.
Back to top
References
Bryk, A.S. and Raudenbush, S.W. (1992) Hierarchical
Linear Models. Sage, Newbury Park.
Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. (Eds.)
(1996) Markov Chain Monte Carlo in Practice. Chapman & Hall,
London.
Little, R.J.A. and Rubin, D.B. (1987) Statistical
Analysis with Missing Data. J. Wiley & Sons, New York.
Meng, X.L.(1995) Multiple-imputation inferences with uncongenial
sources of input (with discussion). Statistical Science,
10, 538-573.
Rubin, D.B. (1976) Inference and missing data. Biometrika,
63, 581-592.
Rubin, D.B. (1987) Multiple Imputation for Nonresponse in
Surveys. J. Wiley & Sons, New York.
Rubin, D.B. (1996) Multiple imputation after 18+ years
(with discussion). Journal of the American Statistical
Association, 91, 473-489.
Schafer, J.L. (1997) Analysis of Incomplete Multivariate
Data. Chapman & Hall, London.
Schafer, J.L. (1999) Multiple imputation: a primer. Statistical
Methods in Medical Research, in press.
Schafer, J.L. and Olsen, M.K. (1998)
Back to top
Go to software for multiple imputation
Go to Joe Schafer's home page