ASCII data input into R
The first step is to copy the file
HIP_star.dat, found at
http://astrostatistics.psu.edu/datasets/HIP_star.html, into some directory and
then use the
setwd command to change the working directory to this directory.
This file gives basic data for 2719 stars from the Hipparcos satellite
with parallaxes in the range 20-25 milliarcseconds. This includes the
Hyades cluster centered at RA=67d, DE=+16 and PAR=22 mas.
The
setwd
command refers to "set working directory". The
read.table command imports
an ASCII table into R. Note: R is
case-sensitive. The up-arrow key can retrieve previous commands,
which may be edited.
setwd("c:/data_directory")
getwd()
list.files()
x = read.table("HIP_star.dat",header=T,fill=T)
Variable
names may not have embedded spaces, and non-alphanumeric characters
(such as "_") are converted to ".". Full astronomical
headers (e.g. FITS, VOTable) should be commented out of the table with comment.char.
Empty cells are
considered to be Not
Available (NA). These can be tested later
using is.na(x),
or removed from consideration using na.omit(x). Mistakes in the data
file are sometimes returned as Not
a Number (NaN).
The write
command is used to create a text file of an object.
Alternatives to the write command include
write.table,
and write.matrix.
write.matrix is most efficient for large datasets, but requires that
you load the MASS package. All of these commands produce text versions of R objects.
To save R objects solely for loading later into R, use the function
save.
R objects that have been saved
can later be reacquired using load.
library(MASS)
write.matrix(x[1:100,],file="First_100_rows.dat")
While R is often run interactively, one often wants to carefully
construct R scripts and run them later. A file containing R code can be run using the source
command. In addition, R may be run in batch mode.
The editor Emacs, together with "Emacs speaks statistics",
provides a nice way to produce R scripts.
The user ends an R session by typing the
q() command. R will then ask
whether to save the workspace image. If the user chooses "yes", then all of the
objects currently in memory are stored in a file called .RData and the commands
typed during the session are appended to the .Rhistory file. These objects are
automatically loaded when R is next invoked from that directory, and the old commands
may be accessed using the up-arrow key.
Characterize the dataset
The following R commands list
the dimensions
of the dataset and print the variable names
(from the single-line header). Then
we list the first row, and list the first 20 rows for the 7th
column. Finally, we find the sum
of the 3rd column.
dim(x)
names(x)
x[1:20,7]
sum(x[,3])
List the
median and
mean absolute deviation
(similar to standard deviation) of each column. First we do
this using a
for-loop,
which is a slow process in R. c is
a generic R function that combines its arguments into a vector. Then we
do it in a more efficient fashion using the apply command. Here
the "2" refers to columns in the x array; a "1" would refer to
rows. print
is a generic R command that prints the result.
A vector can be sorted
using the Shellsort or Quicksort
algorithms; and order
returns a vector of indices that will sort a vector. The latter is often more useful
than sort because it allows one to reorder all
of the rows of a matrix according to one of the columns:
for(i in 1:ncol(x)) {
print(c( median(x[,i]), mad(x[,i])))
}
apply(x, 2, median)
apply(x, 2, mad)
sort(x[1:10,3])
x[order(x[1:10,3]),]
Learning R syntax
Arithmetic
in R is straightforward. Some common operators are: + for
addition, - for subtraction, * for multiplication, / for division, %/%
for integer division, %% for modular arithmetic, ^ for exponentiation.
The help page for these operators may accessed by typing, say,
?'+'
Some common
built-in functions are exp
for the exponential function, sqrt
for square root, log10
for base-10 logarithms, and cos
for cosine. The syntax resembles "sqrt(z)". Comparisons
are made using < (less than), <= (less than or equal), ==
(equal to) with the syntax "a >= b". To test whether a and b
are exactly equal and return a single TRUE/FALSE value (for instance, in an
"if" statement), use the command identical(a,b)
rather a==b. Compare the following two ways of comparing the vectors a and b:
a=c(1,2);b=c(1,3)
a==b
identical(a,b)
Also note that in the above example, 'all(a==b)' is equivalent to 'identical(a,b)'.
R uses the "=" sign much like the language C does: A single "=" means
assignment and a double "==" means comparison. An alternative assignment
operator, "<-", remains from the olden days of S and is still used quite
frequently by programmers to make their code readable. This does create a
minor irritation that is, fortunately, also rare: Note the different
meanings of the following two expressions, which differ only by a space:
y <- 7 # assign the value 7 to
y
y < - 7 # logical; depends on
whether y is less than -7
Let's continue with simple characterization of the dataset: find the
row number of the object with the smallest value of
the 4th column using which.min. A longer, but instructive, way to
accomplish this task creates a long vector of logical constants (tmp),
mostly FALSE with one TRUE, then pick out the row with "TRUE".
which.min(x[,4])
tmp= (x[,4]==min(x[,4]))
(1:nrow(x))[tmp] # or equivalently,
which(tmp)
The cut
function divides the range of x into intervals and codes the values of
x according to which interval they fall. It this is a quick way
to group a vector into bins. Use the "breaks" argument to either
specify a vector of bin boundaries, or give the number of intervals
into which x should be cut. Bin string labels can be
specified. Cut converts numeric vectors into an R object of
class "factor"
which can be ordered and otherwise manipulated; e.g. with command levels.
A more flexible method for dividing a vector into groups using
user-specified rules is given by split.
table(cut(x[,"Plx"],breaks=20:25))
The command above uses several tricks. Note that a column in a matrix may be referred to by its
name (e.g., "Plx") instead of its number. The notation '20:25' is short for 'c(20,21,22,23,24,25)'
and in general, 'a:b' is the vector of consecutive
integers starting with a and ending with b (this also works
if a is larger than b). Finally, the
table command
tabulates the values in a vector or factor.
We may be interested in the blank cells for which R has inserted the value NA.
The following commands count the total number of NAs and print all rows that
contain any NA.
sum(is.na(x))
f = function(a)
any(is.na(a))
tmp=apply(x,1,f)
x[tmp,]
Note that the last three lines of code may be combined into one:
x[apply(x,1,function(a)
any(is.na(a))),]
Univariate
explorations
Load the Hipparcos dataset found at
http://astrostatistics.psu.edu/datasets/HIP_star.html and obtain a
variable-by-variable summary:
hip = read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
header=T,fill=T)
summary(hip)
Recall the variable names in this dataset using the
names
function. By using
attach,
we can automatically create temporary variables with these names (these variables
are not saved as part of the R session, and they are superseded by any other R objects
of the same names).
names(hip)
attach(hip)
After using the attach command, we can obtain, say, individual
summaries
of the variables:
summary(Vmag)
summary(B.V)
Next, summarize some of this information graphically using
box-and-whisker plots for the four variables Vmag, pmRA, pmDE, and B.V (the last variable
used to be B-V, or B minus V, but R does not allow certain characters). These
are the 2nd, 6th, 7th, and 9th columns of 'hip'.
par(mfrow=c(2,2))
for(i in c(2,6,7,9))
boxplot(hip[,i],main=names(hip)[i])
par(mfrow=c(1,1))
The
boxplot
command does more than produce plots; it also returns output that can be
more closely examined. Below, we produce boxplots
and save the output.
b=boxplot(hip[,c(2,6,7,9)])
names(b)
'b' is an object called a list. To understand its contents, read the help for
boxplot. Suppose we wish to see
all of the outliers in the pmRA variable, which is the second of the four variables
in the current boxplot:
b$names[2]
b$out[b$group==2]
Bivariate
plots
We begin by producing a
scatter plot.
plot(Vmag,B.V, pch=".")
Let's now produce the same graph but selecting just for Hyades stars. This open cluster
should be concentrated both in the sky coordinates RA and DE, and also in the proper
motion variables pm_RA and pm_DE. We start by noticing a concentration of stars in the
RA distribution:
plot(Vmag,RA,pch=".")
x1= (RA>50 & RA<100)
Next, we select according to the proper motions. (As our cuts through the data
are parallel to the axes, this variable-by-variable classification approach
is very similar to Classification and Regression
Trees, or CART.)
plot(pmRA[x1],pmDE[x1],pch=20, xlim=c(0,200),ylim=c(-200,100))
x2=(pmRA>90 & pmRA<130)
x3=(pmDE>-60 & pmDE< -10) # Space in '< -' is necessary!
x4=x1&x2&x3
After our selection,
we replot the HR diagram of Vmag vs. B.V. This shows the Zero Age Main Sequence, plus four
red giants, with great precision. Outliers in the original Hipparcos dataset have
been effectively removed. However, let's have a final look at the stars we have
identified using the
pairs
command to produce all bivariate plots for pairs of variables.
We'll exclude the first and fifth columns (the HIP identifying number
and the parallax, which is already known to lie in a narrow band).
pairs(hip[x4,-c(1,5)],pch=20)
Note that indexing a matrix or vector using negative integers has the effect of
excluding the corresponding entries.
We see that there is one outlying star in the DE variable, indicating that this
star is not actually one of the Hyades, and one outlying star in the e_Plx variable,
indicating that its measurements are not reliable. We exclude both points:
x5=x4 & (DE>0) & (e_Plx<5)
pairs(hip[x5,-c(1,5)],pch=20)
Note that the x5 variable, a vector of TRUE and FALSE, may be summed to reveal the
number of TRUE's.
sum(x5)
As a final look at these data, let's consider the original plot of Vmag versus B.V
but make the 92 stars we just identified look bigger (pch=20 instead of 46) and
color them red (col=2 instead of 1):
plot(Vmag,B.V,pch=c(46,20)[1+x5],
col=1+x5)
Linear
least-squares regression
Consider the relationship between DE and pmDE among the 92 stars identified above:
plot(DE[x5],pmDE[x5],pch=20)
If we wish to fit a linear regression (least squares) line to this plot, we may do
so using the
lm
(linear model) function. Note the "response ~ predictor(s)" format used
in formulas for functions like
lm .
m=lm(pmDE[x5] ~ DE[x5])
abline(m,lwd=2,col=2)
The m object just created is an object of class lm. The class of
an object in R can help to determine how it is treated by functions such as
print
and
summary.
m # same as print(m)
summary(m)
There is a lot of information contained in m that is not displayed by
print
or
summary:
names(m)
plot(m$fit,m$resid)
The residual plot produced above reveals no irregularities. Note that when referring
to the items in a list by name, as in m$fit or m$resid above, it is only necessary
to type enough of the name to uniquely identify it.
Other types of
regression
Let's try a non-linear fit, given by
loess or
lowess:
m2=lowess(DE[x5], pmDE[x5])
lines(m2,col=4,lwd=2)
We can also try resistant regression, using the
lqs
function in the MASS package.
library(MASS)
m3=lqs(pmDE[x5]~DE[x5])
abline(m3,col=5, lwd=2)
Finally, let's use least absolute deviation regression. To do this, we'll
use a function called
rq
(regression quantiles)
in the "quantreg" package. This package is
not part of the standard distribution of R, so we'll need to download it.
In order to do this, we must tell R where to store the installed library
using the
install.packages
function.
install.packages("quantreg",lib="c:/",
CRAN="http://lib.stat.cmu.edu/R/CRAN")
library(quantreg, lib.loc="c:/")
Assuming the quantreg package is loaded, we may now compare the least-squares fit (red)
with the least absolute deviations fit (green). In this example, the two
fits are nearly identical:
m4=rq(pmDE[x5]~DE[x5])
abline(m7,col=6, lwd=2)
Suppose we wish to simulate a large number of normal random variables with mean 10 and
standard deviation 3, then check a histogram against
two normal density functions, one based on the true parameters and one based on estimates, to
see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters
red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that
we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale
as the density functions (freq=FALSE).
z=rnorm(200000, mean=10, sd=3)
hist(z,freq=FALSE,nclass=100)
x=seq(min(z),max(z),len=200)
lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3)
lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))
We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean,
a common cutoff used by physical scientists. We can also see the true theoretical proportion:
sum(abs((z-10)/3)>3)/length(z)
2*pnorm(-3)
In the first line above, we are using
sum
to count the number of TRUE's in the logical vector (abs((z-10)/3)>3).
This works because logical values are coerced to 0's and 1's when necessary.
The power-law or Pareto distribution
A commonly used distribution in astrophysics is the power-law distribution,
more commonly known in the statistics literature as the
Pareto distribution. There are no built-in R functions for dealing with this
distribution, but because it is an extremely simple distribution it is easy
to write such functions.
The density function for the Pareto is f(x)=aba/x(a+1) for x>b.
Here, a and b are fixed positive parameters, where
b is the minimum possible value.
As an example, consider the log N = -1.5 * log S
relationship, where S is the apparent brightness and N
is the number of standard candles randomly located in transparent space. Thus, a Pareto
distribution with (a+1) = 1.5 is a reasonable, if simplistic,
model for the brightness of observed standard
candles in space. The b parameter merely reflects the choice of units of measurement.
It turns out that a Pareto random variable is simply b*exp(X), where
X is an
exponential
random variable with rate=a (i.e., with mean=1/a).
However, rather than exploiting this simple relationship,
we wish to build functions for the Pareto distribution from
scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.
dpareto=function(x, a=0.5, b=1) a*b^a/x^(a+1)
Next, we integrate the density function to obtain the distribution function,
which is
F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):
ppareto=function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)
Note that (x > b) in the above function
is coerced to numeric, either 0 or 1.
Inverting the distribution function gives the quantile function. The following
simplistic function is wrong unless 0<u<1, so a better-designed function should do some
error-checking.
qpareto=function(u, a=0.5, b=1) b/(1-u)^(1/a)
Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function
is applied to a uniform random variable, the result is a random variable with the desired distribution:
rpareto=function(n, a=0.5, b=1) qpareto(runif(n),a,b)
Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of
a function may be given a default value, which is used whenever the user calls the function without specifying
the value of this parameter. Also note that each of the above functions consists of only a single line; however,
longer functions may be created by enclosing them inside curly braces { }.
A few simple plots
The commands below create plots related to the four functions just created.
par(mfrow=c(2,2))
x=seq(1,50,len=200)
plot(x,dpareto(x),type="l")
plot(x,ppareto(x),type="l",lty=2)
u=seq(.005,.9,len=200)
plot(u,qpareto(u),type="l",col=3)
z=rpareto(200)
dotchart(log10(z), main="200 random logged Pareto deviates",
cex.main=.7)
par(mfrow=c(1,1))
The above commands illustrate some of the many plotting capabilities of R.
The par
function sets many graphical
parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix
of plots, set here to two rows and two columns. In the
plot
commands, 'type' is set here to
"l" for a line plot; other common options are "p" for points (the default),
"b" for connected dots, and "n" for nothing (to create
axes only). Other options used:
'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red,
3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type
'? par'
to see a list of the many plotting parameters and options.
Earlier, we used exploratory techniques to identify 92 stars from the
Hipparcos
data set that are associated with the Hyades. Now, we will compare these Hyades stars with the remaining stars in
the Hipparcos dataset on the basis of the B.V (B minus V) variable.
That is, we are comparing the groups in the boxplot below:
boxplot(B.V~x5,notch=T,ylab="B minus V")
For ease of notation, we define vectors H and nH (for
"Hyades" and "not Hyades") that
contain the data values for the two groups.
H=B.V[x5]
nH=B.V[!x5 & !is.na(B.V)]
In the definition of nH above, we needed to exclude the NA values (there are no NAs among
the 92 Hyades stars here).
A two-sample t-test may be performed with a single line:
t.test(H,nH)
For purposes of illustration, we may obtain the same results
without resorting to the
t.test
function:
v1=var(H)/92
v2=var(nH)/2586
tstat=(mean(H)-mean(nH))/sqrt(v1+v2)
tstat
To test the null hypothesis, this t statistic is compared to a t distribution. In a Welch
test, we assume that the variances of the two populations are not necessarily equal, and
the degrees of freedom of the t distribution are computed using the so-called
Satterthwaite approximation:
(v1 + v2)^2 / (v1^2/91 + v2^2/2585)
The two-sided p-value is now given by:
2*pt(tstat,97)
In this particular example, the Welch test is probably not necessary, since
the sample variances are so close that an assumption of equal variances is warranted.
Thus, we might instead use:
t.test(H,nH,var.equal=T)
Empirical distribution
functions
The empirical distribution function (EDF) of a sample is, by definition, the cumulative distribution function for the
discrete distribution represented by the sample itself -- that is, the distribution that puts mass
1/n on each of the n sample points. We may graph the EDF using the
ecdf
function:
plot(ecdf(H))
While it is generally very difficult to interpret the EDF directly, it is possible
to compare an EDF to a theoretical cumulative distribution function or two another EDF.
Among the statistical tests that implement such a comparison is the Kolmogorov-Smirnov
test, which is implemented by the R function
ks.test.
ks.test(tlist,"pnorm")
ks.test(H,nH)
Whereas the first result above can be somewhat surprising (a small p-value means a
statistically significant difference),
the second result should not be; we already saw that H and nH have
statistically significantly different means. However, if we center each, we obtain
ks.test(H-mean(H),nH-mean(nH))
In other words, the Kolmogorov-Smirnov test finds no statistically significant evidence that
the distribution of B.V for the Hyades stars is anything other than a shifted version of
the distribution of B.V for the other stars. We'll say more about this
later.
Chi-squared
tests for categorical data
We begin with some boxplots of the Vmag variable broken into groups according to
the B minus V variable:
bvcat=cut(B.V, breaks=c(-Inf,.5,.75,1,Inf))
boxplot(Vmag~bvcat, varwidth=T,
ylim=c(max(Vmag),min(Vmag)),
xlab=expression("B minus V"),
ylab=expression("V magnitude"),
cex.lab=1.4, cex.axis=.8)
The cut values for the newly created categorical variable bvcat are based roughly on the quartiles of the B.V variable.
Recall that x5, the Hyades indicator, was created earlier. Here is
a summary of the dataset based only on these two variables:
table(bvcat,x5)
Note that the Vmag variable is irrelevant in the table above.
To perform a chi-squared test of the null hypothesis that the true population
proportions falling in the four categories are the same for both the Hyades and
non-Hyades stars, use the
chisq.test
function:
chisq.test(bvcat,x5)
Since we already know (from the t-test above) that these two groups differ with respect to the B.V variable,
the result of this test is not too surprising. But it does give a qualitatively different
way to compare these two distributions than simply comparing their means.
The p-value produced above is based on the fact that the chi-squared statistic is
approximately distributed like a true chi-squared distribution (on 3 degrees of freedom,
in this case) if the null hypothesis is true. However, it is possible to obtain
exact p-values, if one wishes to calculate the chi-squared statistic for all possible
tables of counts with the same row and column sums as the given table. Since this is rarely
practical computationally, the exact p-value may be approximated using a Monte Carlo method.
Such a method is implemented in the
chisq.test
function:
chisq.test(bvcat,x5,sim=T,B=50000)
Thus, the first method produces the exact value of an approximate p-value, whereas the second
method produces an approximation to the exact p-value!
The test above is usually called a chi-squared test of homogeneity. If we observe
only one sample, but we wish to test whether the categories occur in some pre-specified
proportions, a similar test (and the same R function) may be applied. In this case, the test
is usually called a chi-squared test of goodness-of-fit.
The goal of principal components is to reduce the number of variables by finding the "best" linear combinations of
all existing variables.
To understand what is "best" in this context,
consider the 22 quantitative measurement columns in the
quas dataset (the first column is the SDSS designation of the object, not a
quantitative measurement). Each
row may be considered to be a point in 22-dimensional Euclidean space. Thus,
the entire dataset consists of a cloud of 279 22-dimensional points. The
"best" linear combination here will be the single vector in 22-space parallel to
which the variance of these 279 points is the greatest. The second-best
will be the single vector orthogonal to the first along which the variance is
the greatest, and so on.
To create a single object containing all the principal components
information you will need (after eliminating the first column), type
pc=princomp(quas[,-1])
Let's see what kind of information is carried in pc.
names(pc)
?princomp
The most important bits, namely pc$sdev and pc$loadings, are obtained directly
from an eigenvalue/eigenvector decomposition of the sample covariance matrix.
If one invokes the princomp command with cor=TRUE, then the
eigen decomposition is performed on the correlation matrix rather than the covariance matrix. Which method is more
appropriate in this case? To answer this question, let's examine
the standard deviations of the columns of quas:
sqrt(apply(quas[,-1],2,var))
Note that the variation of the R.A and Dec. columns is far larger than that of any other
column. Thus, we should not be surprised if these two columns dominate the first two principal
components. In fact, since these two columns together with z give position of the object, we
might want to extract them from the principal components analysis altogether, retaining them
unchanged in the reduced data set. However, we could essentially put all variables on
an equal footing in terms of variability by using the correlation rather than the covariance
(this is equivalent to standardizing each of the variables to have standard deviation equal
to 1 before performing the princomp analysis).
The two most important pieces of information in a principal components analysis are the
variances explained (eigenvalues) and variable loadings (eigenvectors). The former may
be viewed graphically using a technique called a
screeplot:
screeplot(pc)
In the above plot, we see that the variance of the first two PCs dwarfs the others. To
see what this means, we must look at the loadings for these first two PCs:
loadings(pc)
This last command prints a lot of information. Scroll up to see the loadings of components 1
and 2, with any loadings less than 0.1 in absolute value suppressed as unimportant. (In reality,
the loadings for the first principal component are a vector of real numbers, scaled so that
the sum of their squares equals 1. Each element in the vector gives the relative weight of
the corresponding variable.)
To see all of the loadings for the first principal component (and only those), type
pc$load[,1]
We may conclude from the output above that the first principal component consists
essentially of nothing other than R.A (recall that the standard deviation of R.A was much larger than
that of the other variables, so this result is really not surprising).
It is also unsurprising to see that the second principal component consists almost entirely of the
Declination:
pc$load[,2]
These two principal components together comprise over 99.8% of the total variance of these variables, which
makes it difficult to see easily the effect of the remaining principal components. As explained earlier,
one way to deal with the problem of variables on vastly different scales is by analyzing the correlation
matrix rather than the covariance matrix. However, in this case, the two variables causing the trouble
are easy to identify; thus, we'll proceed by performing a new principal components analysis on the remaining
columns of quas after R.A and Dec. are removed:
pc2=princomp(quas[,-(1:3)])
screeplot(pc2)
In the new screeplot, we see three or
four PCs with relatively large variance, one to four PCs with moderate variance, and the
rest with relatively small variance. Let's see what the variable loadings for these first five PCs
are:
loadings(pc2)
Again, it is necessary to scroll up to see the important output.
Examining these loadings, the first PC is somewhat difficult to interpret, but the second is
basically an average of all the "_mag" variables. Notably, the three variables (u_mag, g_mag, r_mag)
always occur with roughly the same weights in the first few PCs, indicating that we may replace these
three with a single variable equal to their mean. The same is true of (i_mag, z_mag) and
(J_mag, H_mag, K_mag). We could thus reduce these 8 variables to 3.
Another approach is to analyze only the principal component
scores themselves, which are contained in pc$scores. This 279x22 matrix contains exactly the
same information as the original dataset, but the axes have been rotated so that the first axis is
the most important in explaining information, followed by the second, etc. Based on our analysis,
only 5 or 6 of these PCs should be very variable.
pairs(pc$scores[,1:6],pch=".")
The drawback to the above plots, of course, is that many of them are difficult to interpret.
In summary, principal components provides an objective way to decide, based on data alone,
how to reduce the dimensionality of a dataset to ease interpretability. However,
substantive astronomical knowledge should be at least as important as
such considerations (e.g., if M_i is known to be important, then maybe it should be kept
regardless of what PC analysis says).
Nonparametric bootstrapping
of regression standard errors
Here, we revisit the
Hipparcos
data set and the 92 stars identified as Hyades stars.
Earlier, we considered a linear model for the relationship
between DE and pmDE among these 92 stars:
plot(DE[x5],pmDE[x5],pch=20)
m=lm(pmDE[x5] ~ DE[x5])
abline(m,lwd=2,col=2)
The red line on the plot is the usual least-squares line, for which estimation is
easy and asymptotic theory gives easy-to-calculate standard errors for the
coefficients:
summary(m)$coef
However, suppose we wish to use a resistant regression method such as
lqs.
library(MASS)
m2=lqs(pmDE[x5] ~ DE[x5])
abline(m2,lwd=2,col=3)
m2
In this case, it is not so easy to obtain standard errors for the coefficients.
Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap,
we repeatedly draw samples of size 92 from the empirical distribution of the data, which
in this case consist of the (DE, pmDE) pairs. We use
lqs
to fit a line to each sample, then compute the sample covariance of the resulting
coefficient vectors. The procedure works like this:
x=DE[x5]
y=pmDE[x5]
m2B=NULL
for (i in 1:200) {
s=sample(92,replace=T)
m2B=rbind(m2B, lqs(y[s]~x[s])$coef)
}
We may now find the sample covariance matrix for m2B. The (marginal) standard errors
of the coefficients are obtained as the square roots of the diagonal entries of this matrix:
cov(m2B)
se=sqrt(diag(cov(m2B)))
se
The logic of the bootstrap procedure is that we are estimating an approximation
of the true standard errors. The approximation involves replacing the true distribution
of the data (unknown) with the empirical distribution of the data. This approximation
may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical
distribution of the data is known and in principle
we may sample from it as many times as we wish. In other words, as the bootstrap sample
size increases, we get a better estimate of the true value of the approximation.
On the other hand, the quality of this approximation depends on the original sample size (92, in
our example) and
there is nothing we can do to change it.
An alternative way to generate a bootstrap sample
in this example is by generating a new value of each response variable (y) by adding the predicted
value from the original lqs model to a randomly selected residual from the original set of residuals.
Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you
might try implementing this approach in R. Note that this approach is not a good idea if you have reason
to believe that the distribution of residuals is not the same for all points. For instance, if there is
heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling
approach is not warranted.
Using the
boot
package in R
There is a
boot
package in R that contains many functions relevant to bootstrapping.
As a quick example, we will show here how to obtain the same kind of bootstrap example
obtained above (for the lqs model of pmDE regressed on DE for the Hyades stars.)
library(boot)
mystat=function(a,b)
lqs(a[b,2]~a[b,1])$coef
m2B2 = boot(cbind(DE[x5],pmDE[x5]),
mystat, 200)
names(m2B2)
As explained in the help file, the
boot
function requires as input a function that accepts as arguments the whole dataset
and an index that references an observation from that dataset. This is why we defined
the mystat function above.
To see the output that is similar to that obtained earlier for the m2B object, look in
m2B2$t:
cov(m2B2$t)
sqrt(diag(cov(m2B2$t)))
Compare with the output provided by
print.boot and the plot produced by
plot.boot:
m2B2
plot(m2B2)
Another related function, for producing bootstrap confidence intervals, is
boot.ci.
Parametric bootstrapping
of regression standard errors
We now return to the regression problem studied earlier.
Sometimes, resampling is done from a theoretical distribution rather than
from the original sample. For instance, if simple linear regression is applied to the
regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals,
namely, normal with mean zero and standard deviation estimated from the regression:
summary(m)
Remember that m was defined above as lm(pmDE[x5]~DE[x5]). We observe a residual standard
error of 4.449.
A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the
model
y = 21.9 - 3.007*DE[x5] + 4.449*rnorm(92)
Then, we refit a linear model using y as the new response,
obtaining slightly different values of the regression coefficients.
If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients
for this model.
Naturally, the same approach could be tried with other regression methods such as those discussed
in the
EDA and regression module,
but careful thought should be
given to the parametric model used to generate the new residuals. In the normal case discussed here,
the parametric bootstrap is simple, but it is really not necessary
because standard linear regression already gives a very good approximation to the joint distribution
of the regression coefficients when errors are heteroscedastic and normal.
One possible use of this method is in a model that assumes the absolute residuals are exponentially
distributed, in which case least absolute deviation regression as discussed in the
EDA and regression module can be justified. The reader is encouraged to
implement a parametric bootstrap using the
rq
function found in the "quantreg" package.
Estimating
a p-value (parametric bootstrap for Kolmogorov-Smirnov)
Earlier, we conducted a t-test of the null hypothesis that the 92 Hyades stars
have the same mean of B minus V as the 2586
non-Hyades stars. We can test a related hypothesis without relying on the
normal approximation of the t-test using a permutation test. Because there are 2678!/(92!2586!)=3.78x10142
different ways to select 92 objects from 2678 objects, we have to depend on a
random sample of such selections:
H=B.V[x5]
nH=B.V[!x5 & !is.na(B.V)]
tlist=NULL
all=c(H,nH)
for(i in 1:5000) {
s=sample(2586,92) # choose a sample
tlist=c(tlist, t.test(all[s],all[-s],
var.eq=T)$stat) # add t-stat to list
}
Even though the sample in tlist appears to be very close to standard normal according to a
histogram, the Kolmogorov-Smirnov test rejects the null hypothesis of standard normality:
ks.test(tlist,"pnorm")
However, if we use a Kolmogorov-Smirnov test in which we allow the mean of the theoretical
normal distribution to be estimated from the sample, we see no evidence of a departure from
normality according to the usual output:
newkstest=ks.test(tlist,"pnorm",mean=mean(tlist))
newkstest
Note in the above function call that
ks.test
passes the "mean=" option directly to the
pnorm
function. The
ks.test
function does not know beforehand how many, if any, options might be passed
to the function it requires; therefore, the function definition of
ks.test
uses three dots (...) in its option list to stand for some unspecified number of options.
Unfortunately, the Kolmogorov-Smirnov test is not quite valid if the theoretical distribution
against which we are testing depends upon estimates derived from the data. (Imagine the extreme
case: We could test the sample distribution against itself, which is simply an estimate derived from
the data!) Therefore, we can get a better sense of how valid the second p-value is by using a
Monte Carlo approach.
To carry out the Monte Carlo test, observe that the stat statistic for our K-S test by typing
newkstest$stat
The value that we seek is the probability that X is at least as large as this observed statistic,
where X is the random test statistic generated when we select a sample of size 5000 from the null
distribution:
ksstat=NULL
for(i in 1:1000) {
x=rnorm(5000)
ksstat=c(ksstat,
ks.test(x,pnorm,mean=mean(x))$stat)
}
Now let's look at a histogram of the test statistics and estimate a p-value:
hist(ksstat,nclass=40)
ns=newkstest$stat
abline(v=ns,lty=2,col=2)
mean(ksstat>=ns)
Note that the approximate p-value above is smaller than the original
p-value reported by newkstest,
though it is still not small enough to provide strong evidence that the
tlist sample is not normal with unit variance.
We conclude this section by briefly explaining the parenthetical
remark "parametric bootstrap for Kolmogorov-Smirnov" in its title and
discussing a related nonparametric bootstrap method.
The above procedure is equivalent to a parametric bootstrap procedure
in which we draw samples not from a standard normal, but from a normal
distribution with mean equal to mean(tlist). This is because the "mean=mean(x)"
option implies that we are applying
a particular K-S test that is invariant to the mean of the distribution.
An interesting exercise is to implement a nonparametric bootstrap
in this example that would be based only on resampling. In this case,
we cannot merely repeatedly use
s=sample(5000,replace=T)
ks.test(tlist[s],pnorm,mean=mean(tlist[s]))$stat
The reason is because this statistic is biased. Therefore, it is necessary to correct for
the bias. Unfortunately, this is not a trivial correction and it essentially requires
a modified version of ks.test: Whereas the ordinary ks.test statistic is based on the maximum
distance between the EDF of the sample and the theoretical CDF, the modified version must
subtract off the bias before finding the maximum.