Hypothesis testing and bootstrapping

This tutorial demonstrates some of the many statistical tests that R can perform. It is impossible to give an exhaustive list of such testing functionality, but we hope not only to provide several examples but also to elucidate some of the logic of statistical hypothesis tests with these examples.

Datasets and other files used in this tutorial:

T tests

In the exploratory data analysis and regression tutorial, we used exploratory techniques to identify 92 stars from the Hipparcos data set that are associated with the Hyades. We did this based on the values of right ascension, declination, principal motion of right ascension, and principal motion of declination. We then excluded one additional star with a large error of parallax measurement:
#   hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
#      header=T, fill=T)
   hip <- read.table("HIP_star.dat", header=T, fill=T)
   attach(hip)
   filter1 <-  (RA>50 & RA<100 & DE>0 & DE<25)
   filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10) 
   filter  <-  filter1 & filter2 & (e_Plx<5) 
   sum(filter)
In this section of the tutorial, we will compare these Hyades stars with the remaining stars in the Hipparcos dataset on the basis of the color (B minus V) variable. That is, we are comparing the groups in the boxplot below:
   color <- B.V
   boxplot(color~filter,notch=T)
For ease of notation, we define vectors H and nH (for "Hyades" and "not Hyades") that contain the data values for the two groups.
   H <- color[filter]
   nH <- color[!filter & !is.na(color)]
In the definition of nH above, we needed to exclude the NA values.

A two-sample t-test may now be performed with a single line:
   t.test(H,nH)
Because it is instructive and quite easy, we may obtain the same results without resorting to the t.test function. First, we calculate the variances of the sample means for each group:
   v1 <- var(H)/92
   v2 <- var(nH)/2586
   c(var(H),var(nH))
The t statistic is based on the standardized difference between the two sample means. Because the two samples are assumed independent, the variance of this difference equals the sum of the individual variances (i.e., v1+v2). Nearly always in a two-sample t-test, we wish to test the null hypothesis that the true difference in means equals zero. Thus, standardizing the difference in means involves subtracting zero and then dividing by the square root of the variance:
   tstat <- (mean(H)-mean(nH))/sqrt(v1+v2)
   tstat
To test the null hypothesis, this t statistic is compared to a t distribution. In a Welch test, we assume that the variances of the two populations are not necessarily equal, and the degrees of freedom of the t distribution are computed using the so-called Satterthwaite approximation:
   (v1 + v2)^2 / (v1^2/91 + v2^2/2585)
The two-sided p-value may now be determined by using the cumulative distribution function of the t distribution, which is given by the pt function.
   2*pt(tstat,97.534)
Incidentally, one of the assumptions of the t-test, namely that each of the two underlying populations is normally distributed, is almost certainly not true in this example. However, because of the central limit theorem, the t-test is robust against violations of this assumption; even if the populations are not roughly normally distributed, the sample means are.

In this particular example, the Welch test is probably not necessary, since the sample variances are so close that an assumption of equal variances is warranted. Thus, we might conduct a slightly more restrictive t-test that assumes equal population variances. Without going into the details here, we merely present the R output:
   t.test(H,nH,var.equal=T)

Chi-squared tests for categorical data

We begin with a plot very similar to one seen in the exploratory data analysis and regression tutorial:
   bvcat <- cut(color, breaks=c(-Inf,.5,.75,1,Inf))
   boxplot(Vmag~bvcat, varwidth=T,
      ylim=c(max(Vmag),min(Vmag)), 
      xlab=expression("B minus V"),
      ylab=expression("V magnitude"),
      cex.lab=1.4, cex.axis=.8)
The cut values for bvcat are based roughly on the quartiles of the B minus V variable. We have created, albeit artificially, a second categorical variable ("filter", the Hyades indicator, is the first). Here is a summary of the dataset based only on these two variables:
   table(bvcat,filter)
Note that the Vmag variable is irrelevant in the table above.

To perform a chi-squared test of the null hypothesis that the true population proportions falling in the four categories are the same for both the Hyades and non-Hyades stars, use the chisq.test function:
   chisq.test(bvcat,filter)
Since we already know these two groups differ with respect to the B.V variable, the result of this test is not too surprising. But it does give a qualitatively different way to compare these two distributions than simply comparing their means.

The p-value produced above is based on the fact that the chi-squared statistic is approximately distributed like a true chi-squared distribution (on 3 degrees of freedom, in this case) if the null hypothesis is true. However, it is possible to obtain exact p-values, if one wishes to calculate the chi-squared statistic for all possible tables of counts with the same row and column sums as the given table. Since this is rarely practical computationally, the exact p-value may be approximated using a Monte Carlo method (just as we did earlier for the permutation test). Such a method is implemented in the chisq.test function:
   chisq.test(bvcat,filter,sim=T,B=50000)
The two different p-values we just generated a numerically similar but based on entirely different mathematics. The difference may be summed up as follows: The first method produces the exact value of an approximate p-value, whereas the second method produces an approximation to the exact p-value!

The test above is usually called a chi-squared test of homogeneity. If we observe only one sample, but we wish to test whether the categories occur in some pre-specified proportions, a similar test (and the same R function) may be applied. In this case, the test is usually called a chi-squared test of goodness-of-fit.

Nonparametric bootstrapping of regression standard errors

Let us consider a linear model for the relationship between DE and pmDE among the 92 Hyades stars:
   x <- DE[filter]
   y <- pmDE[filter]
   plot(x,y,pch=20)
   model1 <- lm(y ~ x)
   abline(model1,lwd=2,col=2)
The red line on the plot is the usual least-squares line, for which estimation is easy and asymptotic theory gives easy-to-calculate standard errors for the coefficients:
   summary(model1)$coef
However, suppose we wish to use a resistant regression method such as lqs.
   library(MASS)
   model2 <- lqs(y ~ x)
   abline(model2,lwd=2,col=3)
   model2
In this case, it is not so easy to obtain standard errors for the coefficients. Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap, we repeatedly draw samples of size 92 from the empirical distribution of the data, which in this case consist of the (DE, pmDE) pairs. We use lqs to fit a line to each sample, then compute the sample covariance of the resulting coefficient vectors. The procedure works like this:
   model2B <- matrix(0,200,2)
   for (i in 1:200) {
      s <- sample(92,replace=T)
      model2B[i,] <- lqs(y[s]~x[s])$coef
   }
We may now find the sample covariance matrix for model2B. The (marginal) standard errors of the coefficients are obtained as the square roots of the diagonal entries of this matrix:
   cov(model2B)
   se <- sqrt(diag(cov(model2B)))
   se
The logic of the bootstrap procedure is that we are estimating an approximation of the true standard errors. The approximation involves replacing the true distribution of the data (unknown) with the empirical distribution of the data. This approximation may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical distribution of the data is known and in principle we may sample from it as many times as we wish. In other words, as the bootstrap sample size increases, we get a better estimate of the true value of the approximation. On the other hand, the quality of this approximation depends on the original sample size (92, in our example) and there is nothing we can do to change it.

An alternative way to generate a bootstrap sample in this example is by generating a new value of each response variable (y) by adding the predicted value from the original lqs model to a randomly selected residual from the original set of residuals. Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you might try implementing this approach in R. Note that this approach is not a good idea if you have reason to believe that the distribution of residuals is not the same for all points. For instance, if there is heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling approach is not warranted.


Using the boot package in R

There is a boot package in R, part of the base R distribution, that contains many functions relevant to bootstrapping. As a quick example, we will show here how to obtain the same kind of bootstrap example obtained above (for the lqs model of pmDE regressed on DE for the Hyades stars.)
   library(boot)
   mystat <- function(a,b)
     lqs(a[b,2]~a[b,1])$coef
   model2B.2  <-  boot(cbind(x,y),
     mystat, 200)
   names(model2B.2)
As explained in the help file, the boot function requires as input a function that accepts as arguments the whole dataset and an index that references an observation from that dataset. This is why we defined the mystat function above. To see the output that is similar to that obtained earlier for the m2B object, look in m2B2$t:
   cov(model2B.2$t)
   sqrt(diag(cov(model2B.2$t)))
Compare with the output provided by print.boot and the plot produced by plot.boot:
   model2B.2
   plot(model2B.2)
Another related function, for producing bootstrap confidence intervals, is boot.ci.

Parametric bootstrapping of regression standard errors

We now return to the regression problem studied earlier.

Sometimes, resampling is done from a theoretical distribution rather than from the original sample. For instance, if simple linear regression is applied to the regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals, namely, normal with mean zero and standard deviation estimated from the regression:
   summary(model1)
Remember that model1 was defined above as lm(y~x). We observe a residual standard error of 4.449.

A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the model
   y <- 21.9 - 3.007*x + 4.449*rnorm(92)
Then, we refit a linear model using y as the new response, obtaining slightly different values of the regression coefficients. If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients for this model.

Naturally, the same approach could be tried with other regression methods such as those discussed in the EDA and regression tutorial, but careful thought should be given to the parametric model used to generate the new residuals. In the normal case discussed here, the parametric bootstrap is simple, but it is really not necessary because standard linear regression already gives a very good approximation to the joint distribution of the regression coefficients when errors are heteroscedastic and normal. One possible use of this method is in a model that assumes the absolute residuals are exponentially distributed, in which case least absolute deviation regression as discussed in the EDA and regression tutorial can be justified. The reader is encouraged to implement a parametric bootstrap using the rq function found in the "quantreg" package.


Kolmogorov-Smirnov: Bootstrapped p-values

Earlier, we generated 5000 samples from the null distribution of the t-statistic for testing the null hypothesis that the distribution of MWG luminosities is the same as the distribution of M31 luminosities, shifted by 24.44. Let's do a similar thing for the color comparison between Hyades and non-Hyades:
   tlist2 <- NULL
   all <- c(H,nH)
   for(i in 1:5000) {
      s <- sample(2586,92) # choose a sample
      tlist2 <- c(tlist2, t.test(all[s],all[-s],
         var.eq=T)$stat) # add t-stat to list
   }
Let's look at two different ways of assessing whether the values in the tlist2 vector appear to be a random sample from a standard normal distribution. The first is graphical, and the second uses the Kolmogorov-Smirnov test.
   plot(qnorm((2*(1:5000)-1)/10000), sort(tlist2))
   abline(0,1,col=2)
   ks.test(tlist2, "pnorm")
The graphical method does not show any major deviation from standard normality, but this graphical "test" is better able to detect departures from an overall normal shape than to detect, say, a shift of the mean away from zero. The following procedures illustrate this fact:
   plot(qnorm((2*(1:5000)-1)/10000), sort(tlist2)-mean(tlist2))
   abline(0,1,col=2)
   ks.test(tlist2-mean(tlist2), "pnorm")
The graphical plot shows no discernable difference from before, but we see a vast difference in the new p-value returned by the Kolmogorov-Smirnov test (!!).

However, let us now consider the p-value returned by the last use of ks.test above. It is not quite valid because the theoretical null distribution against which we are testing depends upon an estimate (the mean) derived from the data. To get a more accurate p-value, we may use a bootstrap approach.

First, obtain the Kolmogorov-Smirnov test statistic from the test above:
   obs.ksstat <- ks.test(tlist2-mean(tlist2),
      "pnorm")$stat
Now we'll generate a new bunch of these statistics under the null hypothesis that tlist2 really represents a random sample from some normal distribution with variance 1 and unknown mean:
   random.ksstat <- NULL
   for(i in 1:1000) {
      x <- rnorm(5000)
      random.ksstat <- c(random.ksstat,
         ks.test(x,pnorm,mean=mean(x))$stat)
   }
Now let's look at a histogram of the test statistics and estimate a p-value:
   hist(random.ksstat,nclass=40)
   abline(v=obs.ksstat,lty=2,col=2)
   mean(random.ksstat>=obs.ksstat)
Note that the approximate p-value above is smaller than the original p-value reported by newkstest, though it is still not small enough to provide strong evidence that the tlist2 sample is not normal with unit variance.

The bootstrap procedure above relied on multiple resamples with replacement. Since these samples were drawn from a theoretical population (in this case, a normal distribution with parameters that might be determined by the data), it is considered a parametric bootstrap procedure.

In a nonparametric bootstrap procedure, the resamples are taken from the empirical distribution of the data (that is, from a distribution that places mass 1/n on each of the n observed values). It is possible to implement a nonparametric bootstrap procedure to calculate a p-value for the Kolmogorov-Smirnov test here, but to do so is a bit tricky. The "straightforward" method for obtaining the null distribution of K-S statistics would be repeated usage of the following:
   s <- sample(5000,replace=T)
   ks.test(tlist[s],pnorm,mean=mean(tlist[s]))$stat
However, this statistic has a bias that must be corrected. To make this correction, it is necessary to rewrite the ks.test function itself. There is no way to use the output of the ks.test function to produce the bias-corrected statistics.