Nonparametric methods

This tutorial demonstrates some of the nonparametric statistical methods available in R.

Datasets and other files used in this tutorial:

Density estimation

Density estimation is like a histogram: It is a method for visualizing the shape of a univariate distribution (there are methods for doing multivariate density estimation as well, but we will ignore those for the time being). We will consider the globular cluster luminosity dataset, which gives measurements for two different galaxies, Milky Way galaxy (MWG) and Andromeda galaxy (M31). We will focus on the andromeda galaxy measurements only for now.
#   gc  <-  read.csv("http://astrostatistics.psu.edu/datasets/glob_clus.csv")
   gc  <-  read.csv("glob_clus.csv")
   attach(gc)
   and <- lum[gal=="M31"]
   mwg <- lum[gal=="MWG"]
   hist(and)
What happens if we try to use more bins in the histogram?
   hist(and, breaks=100)
Notice that we are starting to see more and more bins that include only a single observation. Taken to its extreme, this type of exercise gives in some sense a "perfect" fit to the data but is useless as an estimator of shape:
   hist(and, breaks=10000)
On the other hand, it is obvious that a single bin would also be completely useless. So we try in some sense to find a middle ground between these two extremes: "Oversmoothing" by using only one bin and "undersmooting" by using too many. This same paradigm occurs for density estimation, in which the amount of smoothing is determined by a quantity called the bandwidth. By default, R uses an optimal (in some sense) choice of bandwidth:
   hist(and, prob=T, main="") # scale y-axis like a density function
   d <- density(and)
   lines(d, col=2, lty=2, lwd=2)
   b <- round(d$bw, 4)
   title(main=paste("Bandwidth =", b), col.main=2)
Now observe what happens when we undersmooth
   lines(density(and, bw=.05), col=3, lwd=2)
   text(17.5, .35, "BW = .05", col=3, cex=1.4)
or oversmooth:
   lines(density(and, bw=3), col=4, lwd=2)
   text(18, .12, "BW = 3", col=4, cex=1.4)
While the choice of bandwidth is very important, the choice of kernel is not:
   hist(and, prob=T, main="") # scale y-axis like a density function
   lines(d, col=2, lty=2, lwd=2)
   lines(density(and, ker="epan"), col=3, lwd=2)
   lines(density(and, ker="rect"), col=4, lwd=2)
   title(main="Gaussian, Epanechnikov, Rectangular")

One-sample nonparametric tests

First, load the Hipparcos dataset and recall the variable names using the names function. By using attach, we can automatically create temporary variables with these names (these variables are not saved as part of the R session, and they are superseded by any other R objects of the same names).
#   hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
#      header=T, fill=T)
   hip <- read.table("HIP_star.dat", header=T, fill=T)
   names(hip)
   attach(hip)
Let's take a look at the declination values using a boxplot:
   boxplot(DE, notch=T)
Recall that the notch=TRUE argument tells R to add "notches" to the box to indicate (by default) an approximate 95% confidence interval for the population median (actually, this CI is supposed to be used to test a difference of two medians, but it is still revealing in this one-sample case). Let's add a horizontal line at 0:
   abline(h=0, lty=2)
It appears that 0 is not in the CI, so let's see if we can reject the null hypothesis that the median is zero using a simple sign test. We need to count the number of declination measurements (out of 2719) that are greater than zero:
   sum(DE>0)
We see that there are 1419 such measurements. Because this is a two-sided test, the p-value will be two times the probability that a fair coin flipped 2719 times will come up 1419 or more as "heads". This may be found using the binomial distribution function:
   2*(1-pbinom(1418, 2719, .5)) # Do you see why we used 1418?
So the p-value is smaller than .05 and we can reject the null hypothesis that the median equals zero.

Next, we will perform a Wilcoxon signed rank test of the same null hypothesis:
   wilcox.test(DE)
What happened here? We rejected the null hypothesis using a more simplistic test that is known to be less powerful (the sign test) and we failed to reject using a more powerful test (the Wilcoxon signed rank test). What has happened here is the fact that the Wilcoxon, although it makes no parametric assumption, does assume that the distribution is symmetric around its median. This is a bizarre case in which the observations less than zero, though less numerous, tend to be slightly larger in absolute value. Thus, when we add the absolute ranks of the positive values, there are more of them but they tend to be smaller than they would for a symmetric distribution, and these two competing forces almost exactly balance out. Weird!
   hist(DE) # See the non-symmetry?
   sum(rank(abs(DE))[DE>0]) # Same as Wilcox test stat
   2719*2720/4 # Expected value of stat under H0

Two-sample nonparametric tests

In the exploratory data analysis and regression tutorial, we used exploratory techniques to identify 92 stars from the Hipparcos data set that are associated with the Hyades. We did this based on the values of right ascension, declination, principal motion of right ascension, and principal motion of declination. We then excluded one additional star with a large error of parallax measurement:
   filter1 <-  (RA>50 & RA<100 & DE>0 & DE<25)
   filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10) 
   filter  <-  filter1 & filter2 & (e_Plx<5)
   sum(filter)
In this section of the tutorial, we will compare these Hyades stars with the remaining stars in the Hipparcos dataset on the basis of the color (B minus V) variable. That is, we are comparing the groups in the boxplot below:
   color <- B.V
   boxplot(color~filter,notch=T)
For ease of notation, we define vectors H and nH (for "Hyades" and "not Hyades") that contain the data values for the two groups.
   H <- color[filter]
   nH <- color[!filter & !is.na(color)]
In the definition of nH above, we needed to exclude the NA values.

Let us run a two-sample Wilcoxon rank sum test to compare the medians of H and nH. The assumption of this test is that the shapes of the two distributions are the same, but their medians may be different. The null hypothesis is that the medians are equal.
   wilcox.test(H, nH, conf.int=T)
The confidence limits are obtained using the pairwise differences between H and nH. For details, see the Bauer (1972) reference in the R help file for wilcox.test. Of course we reject H0 with such a small p-value, but note that the estimated difference in location is not equal to the difference of the sample medians:
   median(H) - median(nH)
Instead, the estimator is the Hodges-Lehmann estimator of location, which is the median of all pairwise differences:
   median(outer(H, nH, "-"))
Notice how we used the outer "product" of two vectors but substituted the subtraction operation for the multiplication operation.
   sum(outer(H, nH, "-")>0)   
Here is a quotation from the help file for wilcox.test in R:
The literature is not unanimous about the definitions of the Wilcoxon rank sum and Mann-Whitney tests. The two most common definitions correspond to the sum of the ranks of the first sample with the minimum value subtracted or not: R subtracts and S-PLUS does not, giving a value which is larger by m(m+1)/2 for a first sample of size m. (It seems Wilcoxon's original paper used the unadjusted sum of the ranks but subsequent tables subtracted the minimum.)
So the statistic calculated above, 84531, is the sum of the ranks of the Hyades minus m(m+1)/2. We can check this easily:
   sum(rank(c(H, nH))[1:92]) - 92*93/2

Permutation tests

Let's look at another example of a two-sample situation in which we will use a t-statistic but without assuming that it has a t distribution under the null hypothesis. Recall the globular cluster dataset from earlier, which gives luminosity measurements for the Milky Way galaxy (MWG) and the Andromeda galaxy (M31). The t-statistic gives us a very simplistic way of comparing these galaxies, namely, by the mean luminosity of their globular clusters. Because the apparent magnitudes for M31 are offset by the distance modulus of 24.44, our test should compare the difference in sample means with 24.44 instead of the usual zero:
   t.test(lum~gal, mu=24.44) # Run a t-test to get the statistic
The t-statistic is actually the difference in sample means divided by the standard error of this difference (remember that the standard error of an estimator is an estimate of the standard deviation of that estimator). In this case, the standard error and the t-statistic are given by
   sqrt(var(mwg)/length(mwg) + var(and)/length(and))
   (mean(and) - mean(mwg) - 24.44) / .Last.value
Note the use of the .Last.value object, which is always set to the value of the previously evaluated expression.

The t-statistic above (1.6254) is accompanied by a p-value (0.1074). This p-value may be interpreted as follows: IF the two samples really are representative samples from populations whose means differ by 24.44, THEN the probability of obtaining a t-statistic at least as far from zero as 1.6254 would be roughly 0.1074. Since most people don't consider 0.1074 to be a very small probability, the conclusion here is that there is not strong statistical evidence of a difference in true means other than 24.44.

However, the p-value of 0.1074 is an approximation calculated under certain distributional assumptions on the populations. We will ignore this p-value for now, because we wish to use a permutation test to obtain a p-value for this statistic. An alternative method for calculating a p-value corresponding to the t-statistic of 1.6254 is called a permutation test. Conceptually, the permutation test follows directly from the definition (given in the preceding paragraph) of a p-value.

Let us assume that the luminosities for the MWG globular clusters are distributed exactly like the luminosities for the M31 globular clusters, except that the latter are all shifted down by 24.44. Under this assumption (called the null hypothesis), we may add 24.44 to each of the M31 luminosities and then the combined samples are equivalent to one large sample from the common distribution. In other words, if the null hypothesis is true and we randomly permute the labels (81 instances of MWG and 360 of M31), then the original sample is no different than the permuted sample.

By definition, then, the p-value should be equal to the proportion of all possible t-statistics resulting from label permutations that are at least as extreme as the observed t-statistic (1.6254). Since there are far too many such permuations (roughly 1090) for us to make this exact calculation, we will have to settle for a random sample of permutations:
   tlist <- 1:5000
   lum2 <- lum
   lum2[gal=="M31"] <- lum2[gal=="M31"]-24.44
   for(i in 1:5000) {
      s <- sample(441,81)  # choose a sample
      newtstat  <-  t.test(lum2[s], lum2[-s])$stat
      tlist[i] <- newtstat # add new null t-stat to list
   }
Note: The above code is not built for speed!

By definition, the p-value is the probability of obtaining a test statistic more extreme than the observed test statistic under the null hypothesis. Let's take a look at the null distribution of the t statistic we just calculated, along with the observed value:
   hist(tlist)
   abline(v=c(-1,1)*1.6254,lty=2,col=2)
   sum(abs(tlist)>=1.6254)/5000
This p-value is quite close to the 0.1074 obtained earlier.

Empirical distribution functions and Kolmogorov-Smirnov tests

Suppose we are curious about whether a given sample comes from a particular distribution. For instance, how normal is the random sample 'tlist' of t statistics obtained under the null hypothesis in the previous example? How normal (say) are the colors of 'H' and 'nH'?

A simple yet very powerful graphical device is called a Q-Q plot, in which some quantiles of the sample are plotted against the same quantiles of whatever distribution we have in mind. If a roughly straight line results, this suggests that the fit is good. Roughly, a pth quantile of a distribution is a value such that a proportion p of the distribution lies below that value.

A Q-Q plot for normality is so common that there is a separate function, qqnorm, that implements it. (Normality is vital in statistics not merely because many common populations are normally distributed -- which is actually not true in astronomy -- but because the central limit theorem guarantees the approximate normality of sample means.)
   par(mfrow=c(2,2))
   qqnorm(tlist,main="Null luminosity t statistics")
   abline(0,1,col=2)
   qqnorm(H,main="Hyades")
   qqnorm(nH,main="non-Hyades")
Not surprisingly, the tlist variable appears extremely nearly normally distributed (more precisely, it is nearly standard normal, as evidenced by the proximity of the Q-Q plot to the line x=y, shown in red). As for H and nH, the distribution of B minus V exhibits moderate non-normality in each case.

In the bottom right corner of the plotting window, let's reconstruct the Q-Q plot from scratch for tlist. This is instructive because the same technique may be applied to any comparison distribution, not just normal. If we consider the 5000 entries of tlist in increasing order, let's call the ith value the ((2i-1)/10000)th quantile for all i from 1 to 5000. We merely graph this point against the corresponding quantile of standard normal:
   plot(qnorm((2*(1:5000)-1)/10000),sort(tlist))
   par(mfrow=c(1,1)) # reset plotting window
Related to the Q-Q plot is a distribution function called the empirical (cumulative) distribution function, or EDF. (In fact, the EDF is almost the same as a Q-Q plot against a uniform distribution.) The EDF is, by definition, the cumulative distribution function for the discrete distribution represented by the sample itself -- that is, the distribution that puts mass 1/n on each of the n sample points. We may graph the EDF using the ecdf function:
   plot(ecdf(tlist))
While it is generally very difficult to interpret the EDF directly, it is possible to compare an EDF to a theoretical cumulative distribution function or two another EDF. Among the statistical tests that implement such a comparison is the Kolmogorov-Smirnov test, which is implemented by the R function ks.test.
   ks.test(tlist,"pnorm")
   ks.test(H,nH)
Whereas the first result above gives a surprisingly small p-value, the second result is not surprising; we already saw that H and nH have statistically significantly different means. However, if we center each, we obtain
   myks <- ks.test(H-mean(H),nH-mean(nH))
   myks
It appears that the Kolmogorov-Smirnov test finds no statistically significant evidence (p = 0.4108) that the distribution of B.V for the Hyades stars is anything other than a shifted version of the distribution of B.V for the other stars. This would support (or rather, fail not to support) our earlier assumption in the Wilcoxon two-sample test, namely, that H and nH have the same shape but possibly different locations. HOWEVER, one must be very careful here. The p-value found above is computed using the wrong null distribution because this distribution does not account for the two samples being centered so that their sample means are exactly the same; the null hypothesis used to find the p-value assumes only that the population means are exactly the same. To get a better p-value, we could implement a permutation test as discussed earlier:
   ks0 <- 1:1000
   x <- c(H, nH)
   for(i in 1:1000) {
      s <- sample(2678,92)
      tmp1 <- x[s]
      tmp2 <- x[-s]
      ks0[i] =<- ks.test(tmp1-mean(tmp1), tmp2-mean(tmp2))$stat
   }
Now that we have 1000 points sampled at random from the null distribution of the Kolmogorov-Smirnov test statistic, we can approximate a p-value:
   hist(ks0)
   abline(v = myks$stat, col=2)
   sum(ks0 >= myks$stat) / 1000
We do not have the same concerns about a similar test for the luminosities of globular clusters from both the Milky Way and Andromeda:
   ks.test(mwg, and-24.44)
The above test does give a statistically significant difference. The K-S test tests whether the M31 dataset, shifted by 24.44, comes from the same distribution as the MWG dataset. The t-test, on the other hand, only tests whether these distributions have the same mean.