Bootstrapping
and Monte Carlo methods
A Monte Carlo method generally refers to a method that relies on simulated
random numbers in some way. For instance, bootstrapping may be considered to be
a particular case of a Monte Carlo method, since it relies on random resampling.
Monte
Carlo integration and importance sampling
Most of this module will focus on bootstrapping, but we begin with a toy example
illustrating Monte Carlo methods in general.
Quite often a quantity of interest in statistics may be expressed as an integral
that we wish to evaluate. For instance, in Bayesian analysis, one is often interested
in the posterior mean of a particular parameter, which may be expressed as an integral.
For purely illustrative purposes, suppose that we wish to integrate the following
function over the integral (0,1):
y=function(x) sin(x*(1-x))/(1+x+sqrt(x))
There does not appear to be any closed-form solution, so we turn to numerical methods.
R does have a function,
integrate,
that performs adaptive quadrature of functions of one variable:
integrate(y,0,1)
To approximate this integral using Monte Carlo methods, we note that it may be
written as E(f(U)), the expectation of f(U), where U is a uniform random variable
on the integral (0,1). As is typical in statistics, we will use a sample mean
to approximate the true mean E(f(U)):
u=runif(10000)
mean(y(u))
To get a sense of how precise our estimate is (pretending for the purpose of illustration
that we do not know the correct value of the integral), we can estimate the standard deviation
of this sample mean as follows:
sqrt(var(y(u))/10000)
By the central limit theorem, we know that our sample mean will be roughly normally
distributed, so we can give a 95% confidence interval for the correct value by
adding and subtracting 1.96 times the above standard deviation.
Next, we consider a method for reducing the variance of our
estimator known as importance sampling. Suppose we have a random variable Z
with density function f(z). Then the integral of y(x) that we wish to find may
be rewritten as the expectation of y(Z)/f(Z). (To see why, recall that to find
the expectation of a function of Z, we multiply by the density f(Z) and then
integrate.) In our toy example, suppose that Z has a beta density with parameters
(2, 2.5):
z=rbeta(10000,2,2.5)
mean(y(z)/dbeta(z,2,2.5))
Now for the standard deviation calculation:
sqrt(var(y(z)/dbeta(z,2,2.5))/10000)
Note that this standard deviation is roughly one third of the original standard deviation.
The reason for this is that the beta(2,2.5) density happens to be somewhat close in shape to
a constant times the function y(x) that we are trying to integrate, as seen in the following plot:
x=seq(0,1,len=200)
plot(x,y(x),type="l",
main="Solid line: y(x)")
k = y(.5)/dbeta(.5,2,2.5)
lines(x,k*dbeta(x,2,2.5),lty=2,col=2)
The intuition behind importance sampling, therefore, is that y(z)/dbeta(z,2,2.5) is much closer
to a constant (and therefore has smaller variance) than y(z) itself. In this toy example,
importance sampling reduces the standard deviation to about one third of its original size.
Stated differently, we need a sample roughly nine times as large without importance sampling
as we do with importance sampling in this example
in order to obtain a result of comparable precision.
Nonparametric bootstrapping
of regression standard errors
Here, we revisit the
Hipparcos
data set and the 92 stars identified as Hyades stars in the
exploratory data analysis and regression
module:
hip = read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
header=T,fill=T)
attach(hip)
x1=(RA>50 & RA<100)
x2=(pmRA>90 & pmRA<130)
x3=(pmDE>-60 & pmDE< -10) # Space in '< -' is necessary!
x4=x1&x2&x3
x5=x4 & (DE>0) & (e_Plx<5)
Earlier, we considered a linear model for the relationship
between DE and pmDE among these 92 stars:
plot(DE[x5],pmDE[x5],pch=20)
m=lm(pmDE[x5] ~ DE[x5])
abline(m,lwd=2,col=2)
The red line on the plot is the usual least-squares line, for which estimation is
easy and asymptotic theory gives easy-to-calculate standard errors for the
coefficients:
summary(m)$coef
However, suppose we wish to use a resistant regression method such as
lqs.
library(MASS)
m2=lqs(pmDE[x5] ~ DE[x5])
abline(m2,lwd=2,col=3)
m2
In this case, it is not so easy to obtain standard errors for the coefficients.
Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap,
we repeatedly draw samples of size 92 from the empirical distribution of the data, which
in this case consist of the (DE, pmDE) pairs. We use
lqs
to fit a line to each sample, then compute the sample covariance of the resulting
coefficient vectors. The procedure works like this:
x=DE[x5]
y=pmDE[x5]
m2B=NULL
for (i in 1:200) {
s=sample(92,replace=T)
m2B=rbind(m2B, lqs(y[s]~x[s])$coef)
}
We may now find the sample covariance matrix for m2B. The (marginal) standard errors
of the coefficients are obtained as the square roots of the diagonal entries of this matrix:
cov(m2B)
se=sqrt(diag(cov(m2B)))
se
The logic of the bootstrap procedure is that we are estimating an approximation
of the true standard errors. The approximation involves replacing the true distribution
of the data (unknown) with the empirical distribution of the data. This approximation
may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical
distribution of the data is known and in principle
we may sample from it as many times as we wish. In other words, as the bootstrap sample
size increases, we get a better estimate of the true value of the approximation.
On the other hand, the quality of this approximation depends on the original sample size (92, in
our example) and
there is nothing we can do to change it.
An alternative way to generate a bootstrap sample
in this example is by generating a new value of each response variable (y) by adding the predicted
value from the original lqs model to a randomly selected residual from the original set of residuals.
Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you
might try implementing this approach in R. Note that this approach is not a good idea if you have reason
to believe that the distribution of residuals is not the same for all points. For instance, if there is
heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling
approach is not warranted.
Using the
boot
package in R
There is a
boot
package in R that contains many functions relevant to bootstrapping.
As a quick example, we will show here how to obtain the same kind of bootstrap example
obtained above (for the lqs model of pmDE regressed on DE for the Hyades stars.)
library(boot)
mystat=function(a,b)
lqs(a[b,2]~a[b,1])$coef
m2B2 = boot(cbind(DE[x5],pmDE[x5]),
mystat, 200)
names(m2B2)
As explained in the help file, the
boot
function requires as input a function that accepts as arguments the whole dataset
and an index that references an observation from that dataset. This is why we defined
the mystat function above.
To see the output that is similar to that obtained earlier for the m2B object, look in
m2B2$t:
cov(m2B2$t)
sqrt(diag(cov(m2B2$t)))
Compare with the output provided by
print.boot and the plot produced by
plot.boot:
m2B2
plot(m2B2)
Another related function, for producing bootstrap confidence intervals, is
boot.ci.
Parametric bootstrapping
of regression standard errors
We now return to the regression problem studied earlier.
Sometimes, resampling is done from a theoretical distribution rather than
from the original sample. For instance, if simple linear regression is applied to the
regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals,
namely, normal with mean zero and standard deviation estimated from the regression:
summary(m)
Remember that m was defined above as lm(pmDE[x5]~DE[x5]). We observe a residual standard
error of 4.449.
A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the
model
y = 21.9 - 3.007*DE[x5] + 4.449*rnorm(92)
Then, we refit a linear model using y as the new response,
obtaining slightly different values of the regression coefficients.
If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients
for this model.
Naturally, the same approach could be tried with other regression methods such as those discussed
in the
EDA and regression module,
but careful thought should be
given to the parametric model used to generate the new residuals. In the normal case discussed here,
the parametric bootstrap is simple, but it is really not necessary
because standard linear regression already gives a very good approximation to the joint distribution
of the regression coefficients when errors are heteroscedastic and normal.
One possible use of this method is in a model that assumes the absolute residuals are exponentially
distributed, in which case least absolute deviation regression as discussed in the
EDA and regression module can be justified. The reader is encouraged to
implement a parametric bootstrap using the
rq
function found in the "quantreg" package.
Estimating
a p-value (parametric bootstrap for Kolmogorov-Smirnov)
In the
hypothesis testing
module, we conducted a permutation test of the null hypothesis that
the 92 Hyades stars have the same distribution of B minus V as the 2586
non-Hyades stars. Because there are 2678!/(92!2586!)=3.78x10142
different ways to select 92 objects from 2678 objects, we had to depend on a
random sample of reassignments:
H=B.V[x5]
nH=B.V[!x5 & !is.na(B.V)]
tlist=NULL
all=c(H,nH)
for(i in 1:5000) {
s=sample(2586,92) # choose a sample
tlist=c(tlist, t.test(all[s],all[-s],
var.eq=T)$stat) # add t-stat to list
}
Even though the sample in tlist appears to be very close to standard normal according to a
histogram and a Q-Q plot, the
Kolmogorov-Smirnov test rejects the null hypothesis of standard normality:
ks.test(tlist,"pnorm")
However, if we use a Kolmogorov-Smirnov test in which we allow the mean of the theoretical
normal distribution to be estimated from the sample, we see no evidence of a departure from
normality according to the usual output:
newkstest=ks.test(tlist,"pnorm",mean=mean(tlist))
newkstest
Note in the above function call that
ks.test
passes the "mean=" option directly to the
pnorm
function. The
ks.test
function does not know beforehand how many, if any, options might be passed
to the function it requires; therefore, the function definition of
ks.test
uses three dots (...) in its option list to stand for some unspecified number of options.
Unfortunately, the Kolmogorov-Smirnov test is not quite valid if the theoretical distribution
against which we are testing depends upon estimates derived from the data. (Imagine the extreme
case: We could test the sample distribution against itself, which is simply an estimate derived from
the data!) Therefore, we can get a better sense of how valid the second p-value is by using a
Monte Carlo approach.
To carry out the Monte Carlo test, observe that the stat statistic for our K-S test by typing
newkstest$stat
The value that we seek is the probability that X is at least as large as this observed statistic,
where X is the random test statistic generated when we select a sample of size 5000 from the null
distribution:
ksstat=NULL
for(i in 1:1000) {
x=rnorm(5000)
ksstat=c(ksstat,
ks.test(x,pnorm,mean=mean(x))$stat)
}
Now let's look at a histogram of the test statistics and estimate a p-value:
hist(ksstat,nclass=40)
ns=newkstest$stat
abline(v=ns,lty=2,col=2)
mean(ksstat>=ns)
Note that the approximate p-value above is smaller than the original
p-value reported by newkstest,
though it is still not small enough to provide strong evidence that the
tlist sample is not normal with unit variance.
We conclude this section by briefly explaining the parenthetical
remark "parametric bootstrap for Kolmogorov-Smirnov" in its title and
discussing a related nonparametric bootstrap method.
The above procedure is equivalent to a parametric bootstrap procedure
in which we draw samples not from a standard normal, but from a normal
distribution with mean equal to mean(tlist). This is because the "mean=mean(x)"
option implies that we are applying
a particular K-S test that is invariant to the mean of the distribution.
An interesting exercise is to implement a nonparametric bootstrap
in this example that would be based only on resampling. In this case,
we cannot merely repeatedly use
s=sample(5000,replace=T)
ks.test(tlist[s],pnorm,mean=mean(tlist[s]))$stat
The reason is because this statistic is biased. Therefore, it is necessary to correct for
the bias. Unfortunately, this is not a trivial correction and it essentially requires
a modified version of ks.test: Whereas the ordinary ks.test statistic is based on the maximum
distance between the EDF of the sample and the theoretical CDF, the modified version must
subtract off the bias before finding the maximum.