Nonparametric methods
This tutorial demonstrates some of the nonparametric statistical
methods available in R.
Datasets and other files used in this tutorial:
Density estimation
Density estimation is like a histogram: It is a method for visualizing the shape of a
univariate distribution (there are methods for doing multivariate density estimation as
well, but we will ignore those for the time being). We will consider the
globular cluster luminosity dataset, which
gives measurements for two different galaxies,
Milky Way galaxy (MWG) and Andromeda galaxy (M31). We will focus on the andromeda
galaxy measurements only for now.
# gc <- read.csv("http://astrostatistics.psu.edu/datasets/glob_clus.csv")
gc <- read.csv("glob_clus.csv")
attach(gc)
and <- lum[gal=="M31"]
mwg <- lum[gal=="MWG"]
hist(and)
What happens if we try to use more bins in the histogram?
hist(and, breaks=100)
Notice that we are starting to see more and more bins that include only a single observation.
Taken to its extreme, this type of exercise gives in some sense a "perfect" fit to the data but is useless
as an estimator of shape:
hist(and, breaks=10000)
On the other hand, it is obvious that a single bin would also be completely useless. So we try in some
sense to find a middle ground between these two extremes: "Oversmoothing" by using only one bin and
"undersmooting" by using too many. This same paradigm occurs for density estimation, in which
the amount of smoothing is determined by a quantity called the bandwidth. By default, R
uses an optimal (in some sense) choice of bandwidth:
hist(and, prob=T, main="") # scale y-axis like a density function
d <- density(and)
lines(d, col=2, lty=2, lwd=2)
b <- round(d$bw, 4)
title(main=paste("Bandwidth =", b), col.main=2)
Now observe what happens when we undersmooth
lines(density(and, bw=.05), col=3, lwd=2)
text(17.5, .35, "BW = .05", col=3, cex=1.4)
or oversmooth:
lines(density(and, bw=3), col=4, lwd=2)
text(18, .12, "BW = 3", col=4, cex=1.4)
While the choice of bandwidth is very important, the choice of kernel is not:
hist(and, prob=T, main="") # scale y-axis like a density function
lines(d, col=2, lty=2, lwd=2)
lines(density(and, ker="epan"), col=3, lwd=2)
lines(density(and, ker="rect"), col=4, lwd=2)
title(main="Gaussian, Epanechnikov, Rectangular")
One-sample nonparametric tests
First, load the
Hipparcos dataset and recall the variable names using the
names
function. By using
attach,
we can automatically create temporary variables with these names (these variables
are not saved as part of the R session, and they are superseded by any other R objects
of the same names).
# hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
# header=T, fill=T)
hip <- read.table("HIP_star.dat", header=T, fill=T)
names(hip)
attach(hip)
Let's take a look at the declination values using a boxplot:
boxplot(DE, notch=T)
Recall that the notch=TRUE argument tells R to add "notches" to the box to indicate
(by default) an approximate 95% confidence interval for the population median (actually,
this CI is supposed to be used to test a difference of two medians, but it is still revealing
in this one-sample case). Let's add a horizontal line at 0:
abline(h=0, lty=2)
It appears that 0 is not in the CI, so let's see if we can reject the null hypothesis that the
median is zero using a simple sign test. We need to count the number of declination measurements
(out of 2719) that are greater than zero:
sum(DE>0)
We see that there are 1419 such measurements. Because this is a two-sided test, the p-value will be
two times the probability that a fair coin flipped 2719 times will come up 1419 or more as "heads".
This may be found using the binomial distribution function:
2*(1-pbinom(1418, 2719, .5)) # Do you see why we used 1418?
So the p-value is smaller than .05 and we can reject the null hypothesis that the median
equals zero.
Next, we will perform a Wilcoxon signed rank test of the same null hypothesis:
wilcox.test(DE)
What happened here? We rejected the null hypothesis using a more simplistic test that is known to be
less powerful (the sign test) and we failed to reject using a more powerful test (the Wilcoxon
signed rank test). What has happened here is the fact that the Wilcoxon, although it makes no
parametric assumption, does assume that the distribution is symmetric around its median. This
is a bizarre case in which the observations less than zero, though less numerous, tend to be slightly
larger in absolute value. Thus, when we add the absolute ranks of the positive values, there are more
of them but they tend to be smaller than they would for a symmetric distribution, and these two
competing forces almost exactly balance out. Weird!
hist(DE) # See the non-symmetry?
sum(rank(abs(DE))[DE>0]) # Same as Wilcox test stat
2719*2720/4 # Expected value of stat under H0
Two-sample nonparametric tests
In the
exploratory data analysis and regression
tutorial, we used exploratory techniques to identify 92 stars from the
Hipparcos
data set that are associated with the Hyades. We did this based on the values
of right ascension, declination, principal motion of right ascension,
and principal motion of declination. We then excluded one additional
star with a large error of parallax measurement:
filter1 <- (RA>50 & RA<100 & DE>0 & DE<25)
filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10)
filter <- filter1 & filter2 & (e_Plx<5)
sum(filter)
In this section of the tutorial, we will compare these Hyades stars with the remaining stars in
the Hipparcos dataset on the basis of the color (B minus V) variable.
That is, we are comparing the groups in the boxplot below:
color <- B.V
boxplot(color~filter,notch=T)
For ease of notation, we define vectors H and nH (for
"Hyades" and "not Hyades") that
contain the data values for the two groups.
H <- color[filter]
nH <- color[!filter & !is.na(color)]
In the definition of nH above, we needed to exclude the NA values.
Let us run a two-sample Wilcoxon rank sum test to compare the medians of
H and nH. The assumption of this test is that the shapes of the
two distributions are the same, but their medians may be different. The
null hypothesis is that the medians are equal.
wilcox.test(H, nH, conf.int=T)
The confidence limits are obtained using the pairwise differences between H and
nH. For details, see the Bauer (1972) reference in the R help file for
wilcox.test.
Of course we reject H0 with such a small p-value, but note that the estimated
difference in location is not equal to the difference of the sample medians:
median(H) - median(nH)
Instead, the estimator is the Hodges-Lehmann estimator of location, which
is the median of all pairwise differences:
median(outer(H, nH, "-"))
Notice how we used the outer "product" of two vectors
but substituted the subtraction operation for the
multiplication operation.
sum(outer(H, nH, "-")>0)
Here is a quotation from the help file for wilcox.test in R:
The literature is not unanimous about the definitions of the Wilcoxon rank
sum and Mann-Whitney tests. The two most common definitions correspond to
the sum of the ranks of the first sample with the minimum value subtracted
or not: R subtracts and S-PLUS does not, giving a value which is larger by
m(m+1)/2 for a first sample of size m. (It seems Wilcoxon's original paper used the
unadjusted sum of the ranks but subsequent tables subtracted the minimum.)
So the statistic calculated above, 84531, is the sum of the ranks of the Hyades
minus m(m+1)/2. We can check this easily:
sum(rank(c(H, nH))[1:92]) - 92*93/2
Permutation tests
Let's look at another example of a two-sample situation in which
we will use a t-statistic but without assuming that it has a t
distribution under the null hypothesis.
Recall the globular cluster dataset from earlier,
which gives luminosity measurements for the
Milky Way galaxy (MWG) and the Andromeda galaxy (M31). The t-statistic
gives us a very simplistic way of comparing these galaxies, namely, by
the mean luminosity of their globular clusters. Because the apparent magnitudes
for M31 are offset by the distance modulus of 24.44, our test should compare
the difference in sample means with 24.44 instead of the usual zero:
t.test(lum~gal, mu=24.44) # Run a t-test to get the statistic
The t-statistic is actually the difference in sample means divided by the standard
error of this difference (remember that the standard error of an estimator is an
estimate of the standard deviation of that estimator). In this case, the standard
error and the t-statistic are given by
sqrt(var(mwg)/length(mwg) + var(and)/length(and))
(mean(and) - mean(mwg) - 24.44) / .Last.value
Note the use of the .Last.value object, which is always set to the value of
the previously evaluated expression.
The t-statistic above (1.6254) is accompanied by a p-value (0.1074).
This p-value may be interpreted as follows: IF the two samples really are representative
samples from populations whose means differ by 24.44, THEN the probability of
obtaining a t-statistic at least as far from zero as 1.6254 would be
roughly 0.1074. Since most people don't consider 0.1074 to be a very small
probability, the conclusion here is that there is not strong statistical
evidence of a difference in true means other than 24.44.
However, the p-value of 0.1074 is an approximation
calculated under certain distributional assumptions
on the populations.
We will ignore this p-value for now, because we wish to use a permutation
test to obtain a p-value for this statistic.
An alternative method for calculating a p-value corresponding
to the t-statistic of 1.6254 is called a permutation test. Conceptually, the
permutation test follows directly from the definition (given in the preceding
paragraph) of a p-value.
Let us assume that the luminosities
for the MWG globular clusters are distributed exactly like the luminosities
for the M31 globular clusters, except that the latter are all shifted down by
24.44. Under this assumption (called the null hypothesis), we may add 24.44 to
each of the M31 luminosities and then the combined samples are equivalent to one
large sample from the common distribution. In other words, if the null hypothesis
is true and we randomly
permute the labels (81 instances of MWG and 360 of M31), then the original sample
is no different than the permuted sample.
By definition, then, the p-value should be equal to the proportion of all
possible t-statistics resulting from label permutations that are at least
as extreme as the observed t-statistic (1.6254). Since there are far too
many such permuations (roughly 1090)
for us to make this exact calculation, we will have to settle
for a random sample of permutations:
tlist <- 1:5000
lum2 <- lum
lum2[gal=="M31"] <- lum2[gal=="M31"]-24.44
for(i in 1:5000) {
s <- sample(441,81) # choose a sample
newtstat <- t.test(lum2[s], lum2[-s])$stat
tlist[i] <- newtstat # add new null t-stat to list
}
Note: The above code is not built for speed!
By definition, the p-value is the probability of obtaining a test statistic more extreme than the
observed test statistic under the null hypothesis. Let's take a look at the null distribution of
the t statistic we just calculated, along with the observed value:
hist(tlist)
abline(v=c(-1,1)*1.6254,lty=2,col=2)
sum(abs(tlist)>=1.6254)/5000
This p-value is quite close to the 0.1074 obtained earlier.
Empirical distribution functions and Kolmogorov-Smirnov tests
Suppose we are curious about whether a given sample comes from a particular
distribution. For instance, how normal is the random sample 'tlist' of t statistics
obtained under the null hypothesis in the previous example? How normal (say) are the
colors of 'H' and 'nH'?
A simple yet very powerful graphical device is called a Q-Q plot, in which some quantiles
of the sample are plotted against the same quantiles of whatever distribution we have in mind.
If a roughly straight line results, this suggests that the fit is good. Roughly, a pth quantile of
a distribution is a value such that a proportion p of the distribution lies below that value.
A Q-Q plot for normality is so common that there is a separate function,
qqnorm, that implements it. (Normality
is vital in statistics not merely because many common populations
are normally distributed --
which is actually not true in astronomy -- but because the central limit
theorem guarantees the
approximate normality of sample means.)
par(mfrow=c(2,2))
qqnorm(tlist,main="Null luminosity t statistics")
abline(0,1,col=2)
qqnorm(H,main="Hyades")
qqnorm(nH,main="non-Hyades")
Not surprisingly, the tlist variable appears extremely nearly normally distributed (more precisely,
it is nearly standard normal, as evidenced by the proximity of the Q-Q plot to the line
x=y, shown in red). As for H and nH, the distribution of B minus V exhibits moderate non-normality in each
case.
In the bottom right corner of the plotting window, let's reconstruct the Q-Q plot from scratch for
tlist. This is instructive because the same technique may be applied to any comparison distribution,
not just normal. If we consider the 5000 entries of tlist in increasing order, let's call the ith
value the ((2i-1)/10000)th quantile for all i from 1 to 5000. We merely graph this point against the
corresponding quantile of standard normal:
plot(qnorm((2*(1:5000)-1)/10000),sort(tlist))
par(mfrow=c(1,1)) # reset plotting window
Related to the Q-Q plot is a distribution function called the empirical
(cumulative) distribution function, or
EDF. (In fact, the EDF is almost the same as a Q-Q plot against a uniform distribution.)
The EDF is, by definition, the cumulative distribution function for the
discrete distribution represented by the sample itself -- that is, the distribution that puts mass
1/n on each of the n sample points. We may graph the EDF using the
ecdf
function:
plot(ecdf(tlist))
While it is generally very difficult to interpret the EDF directly, it is possible
to compare an EDF to a theoretical cumulative distribution function or two another EDF.
Among the statistical tests that implement such a comparison is the Kolmogorov-Smirnov
test, which is implemented by the R function
ks.test.
ks.test(tlist,"pnorm")
ks.test(H,nH)
Whereas the first result above gives a surprisingly small p-value,
the second result is not surprising; we already saw that H and nH have
statistically significantly different means. However, if we center each, we obtain
myks <- ks.test(H-mean(H),nH-mean(nH))
myks
It appears that
the Kolmogorov-Smirnov test finds no statistically significant evidence
(p = 0.4108) that
the distribution of B.V for the Hyades stars is anything other than a shifted version of
the distribution of B.V for the other stars. This would support (or rather,
fail not to support) our earlier assumption in the Wilcoxon two-sample test, namely,
that H and nH have the same shape but possibly different locations.
HOWEVER, one must be very careful here. The p-value found above is computed using
the wrong null distribution because this distribution does not account for the two samples
being centered so that their sample means are exactly the same;
the null hypothesis used to
find the p-value assumes only that the population means are exactly the same.
To get a better p-value, we
could implement a permutation test as discussed earlier:
ks0 <- 1:1000
x <- c(H, nH)
for(i in 1:1000) {
s <- sample(2678,92)
tmp1 <- x[s]
tmp2 <- x[-s]
ks0[i] =<- ks.test(tmp1-mean(tmp1), tmp2-mean(tmp2))$stat
}
Now that we have 1000 points sampled at random from the null distribution of the
Kolmogorov-Smirnov test statistic, we can approximate a p-value:
hist(ks0)
abline(v = myks$stat, col=2)
sum(ks0 >= myks$stat) / 1000
We do not have the same concerns about
a similar test for
the luminosities of globular clusters from both the Milky Way and Andromeda:
ks.test(mwg, and-24.44)
The above test does give a statistically significant difference. The
K-S test tests whether the M31 dataset, shifted by 24.44,
comes from the same distribution as the MWG
dataset. The t-test, on the other hand, only tests whether these
distributions have the same mean.