R:  A powerful public software environment for statistical analysis of astronomical data
 
R is the most comprehensive public-domain statistical software available today.  An implementation of the S language for statistical analysis, it is increasingly popular in the statistical community. It takes only a few minutes to obtain:  go to http://www.R-project.org , choose a CRAN mirror site, and download the executable appropriate for your operating system.
 
The following tutorial is an abridged version of a weeklong set of tutorials originally presented at the 2005 Summer School in Statistics for Astronomers and Physicists at Penn State University. 
 
 
 
Start learning R with an astronomical dataset


ASCII data input into R

The first step is to copy the file HIP_star.dat, found at http://astrostatistics.psu.edu/datasets/HIP_star.html, into some directory and then use the setwd command to change the working directory to this directory.  This file gives basic data for 2719 stars  from the Hipparcos satellite with parallaxes in the range 20-25 milliarcseconds.  This includes the Hyades cluster centered at RA=67d, DE=+16 and PAR=22 mas.  

The setwd command refers to "set working directory".  The read.table command imports an ASCII table into R.  Note: R is case-sensitive. The up-arrow key can retrieve previous commands, which may be edited.

   setwd("c:/data_directory")
   getwd()
   list.files()
   x = read.table("HIP_star.dat",header=T,fill=T)

Variable names may not have embedded spaces, and non-alphanumeric characters (such as "_") are converted to ".".  Full astronomical headers (e.g. FITS, VOTable) should be commented out of the table with comment.char.  Empty cells are considered to be Not Available (NA).  These can be tested later using is.na(x), or removed from consideration using na.omit(x).  Mistakes in the data file are sometimes returned as Not a Number (NaN).

The write command is used to create a text file of an object. Alternatives to the write command include write.table, and write.matrix.  write.matrix is most efficient for large datasets, but requires that you load the MASS package.  All of these commands produce text versions of R objects. To save R objects solely for loading later into R, use the function save. R objects that have been saved can later be reacquired using load.

   library(MASS)
   write.matrix(x[1:100,],file="First_100_rows.dat")
 

While R is often run interactively, one often wants to carefully construct R scripts and run them later.  A file containing R code can be run using the source command. In addition, R may be run in batch mode.  The editor Emacs, together with "Emacs speaks statistics", provides a nice way to produce R scripts.

The user ends an R session by typing the q() command.  R will then ask whether to save the workspace image. If the user chooses "yes", then all of the objects currently in memory are stored in a file called .RData and the commands typed during the session are appended to the .Rhistory file. These objects are automatically loaded when R is next invoked from that directory, and the old commands may be accessed using the up-arrow key.

Characterize the dataset

The following R commands list the dimensions of the dataset and print the variable names (from the single-line header).  Then we list the first row, and list the first 20 rows for the 7th column.  Finally, we find the sum of the 3rd column. 

   dim(x)
   names(x)
   x[1:20,7]
   sum(x[,3])

List the median and mean absolute deviation (similar to standard deviation)  of each column.  First we do this using a for-loop, which is a slow process in R.  c is a generic R function that combines its arguments into a vector. Then we do it in a more efficient fashion using the apply command.  Here the "2" refers to columns in the x array; a "1" would refer to rows.  print is a generic R command that prints the result. A vector can be sorted using the Shellsort or Quicksort algorithms; and order returns a vector of indices that will sort a vector. The latter is often more useful than sort because it allows one to reorder all of the rows of a matrix according to one of the columns:

   for(i in 1:ncol(x)) {
      print(c( median(x[,i]), mad(x[,i])))
   }
   apply(x, 2, median)
   apply(x, 2, mad)
    sort(x[1:10,3])
   x[order(x[1:10,3]),]

Learning R syntax

Arithmetic in R is straightforward.  Some common operators are: + for addition, - for subtraction, * for multiplication, / for division, %/% for integer division,  %% for modular arithmetic, ^ for exponentiation.  The help page for these operators may accessed by typing, say,

   ?'+'

Some common built-in functions are exp for the exponential function, sqrt for square root, log10 for base-10 logarithms, and cos for cosine. The syntax resembles "sqrt(z)".   Comparisons are made using < (less than), <= (less than or equal), == (equal to) with the syntax "a >= b".  To test whether a and b are exactly equal and return a single TRUE/FALSE value (for instance, in an "if" statement), use the command identical(a,b) rather a==b.  Compare the following two ways of comparing the vectors a and b:

   a=c(1,2);b=c(1,3)
   a==b
   identical(a,b)

Also note that in the above example, 'all(a==b)' is equivalent to 'identical(a,b)'.

R uses the "=" sign much like the language C does:  A single "=" means assignment and a double "==" means comparison.  An alternative assignment operator, "<-", remains from the olden days of S and is still used quite frequently by programmers to make their code readable.  This does create a minor irritation that is, fortunately, also rare:  Note the different meanings of the following two expressions, which differ only by a space: 
  
   y <- 7  # assign the value 7 to y
   y < - 7 # logical; depends on whether y is less than -7

Let's continue with simple characterization of the dataset: find the row number of the object with the smallest value of the 4th column using which.min.  A longer, but instructive, way to accomplish this task creates a long vector of logical constants (tmp), mostly FALSE with one TRUE, then pick out the row with "TRUE". 

   which.min(x[,4])
   tmp= (x[,4]==min(x[,4]))
   (1:nrow(x))[tmp]    #   or equivalently,
   which(tmp)

The cut function divides the range of x into intervals and codes the values of x according to which interval they fall.  It this is a quick way to group a vector into bins.  Use the "breaks" argument to either specify a vector of bin boundaries, or give the number of intervals into which x should be cut.  Bin string labels can be specified.  Cut converts numeric vectors into an R object of class  "factor" which can be ordered and otherwise manipulated; e.g. with command levels.  A more flexible method for dividing a vector into groups using user-specified rules is given by split.

   table(cut(x[,"Plx"],breaks=20:25))

The command above uses several tricks. Note that a column in a matrix may be referred to by its name (e.g., "Plx") instead of its number. The notation '20:25' is short for 'c(20,21,22,23,24,25)' and in general, 'a:b' is the vector of consecutive integers starting with a and ending with b (this also works if a is larger than b). Finally, the table command tabulates the values in a vector or factor.

We may be interested in the blank cells for which R has inserted the value NA.  The following commands count the total number of NAs and print all rows that contain any NA. 

   sum(is.na(x))
   f = function(a) any(is.na(a))
   tmp=apply(x,1,f)
   x[tmp,]

Note that the last three lines of code may be combined into one:

   x[apply(x,1,function(a) any(is.na(a))),]



Exploratory Data Analysis and Regression


Univariate explorations

Load the Hipparcos dataset found at http://astrostatistics.psu.edu/datasets/HIP_star.html and obtain a variable-by-variable summary:

   hip = read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
      header=T,fill=T)
   summary(hip)

Recall the variable names in this dataset using the names function. By using attach, we can automatically create temporary variables with these names (these variables are not saved as part of the R session, and they are superseded by any other R objects of the same names).

   names(hip)
   attach(hip)

After using the attach command, we can obtain, say, individual summaries of the variables:

   summary(Vmag)
   summary(B.V)

Next, summarize some of this information graphically using box-and-whisker plots for the four variables Vmag, pmRA, pmDE, and B.V (the last variable used to be B-V, or B minus V, but R does not allow certain characters). These are the 2nd, 6th, 7th, and 9th columns of 'hip'.

   par(mfrow=c(2,2))
   for(i in c(2,6,7,9))
      boxplot(hip[,i],main=names(hip)[i])
   par(mfrow=c(1,1))

The boxplot command does more than produce plots; it also returns output that can be more closely examined. Below, we produce boxplots and save the output.

   b=boxplot(hip[,c(2,6,7,9)])
   names(b)

'b' is an object called a list. To understand its contents, read the help for boxplot. Suppose we wish to see all of the outliers in the pmRA variable, which is the second of the four variables in the current boxplot:

   b$names[2]
   b$out[b$group==2]


Bivariate plots

We begin by producing a scatter plot.

   plot(Vmag,B.V, pch=".")

Let's now produce the same graph but selecting just for Hyades stars. This open cluster should be concentrated both in the sky coordinates RA and DE, and also in the proper motion variables pm_RA and pm_DE. We start by noticing a concentration of stars in the RA distribution:

   plot(Vmag,RA,pch=".")
   x1= (RA>50 & RA<100)

Next, we select according to the proper motions. (As our cuts through the data are parallel to the axes, this variable-by-variable classification approach is very similar to Classification and Regression Trees, or CART.)

   plot(pmRA[x1],pmDE[x1],pch=20, xlim=c(0,200),ylim=c(-200,100))
   x2=(pmRA>90 & pmRA<130)
   x3=(pmDE>-60 & pmDE< -10) # Space in '< -' is necessary!
   x4=x1&x2&x3

After our selection, we replot the HR diagram of Vmag vs. B.V. This shows the Zero Age Main Sequence, plus four red giants, with great precision. Outliers in the original Hipparcos dataset have been effectively removed. However, let's have a final look at the stars we have identified using the pairs command to produce all bivariate plots for pairs of variables. We'll exclude the first and fifth columns (the HIP identifying number and the parallax, which is already known to lie in a narrow band).

   pairs(hip[x4,-c(1,5)],pch=20)

Note that indexing a matrix or vector using negative integers has the effect of excluding the corresponding entries.

We see that there is one outlying star in the DE variable, indicating that this star is not actually one of the Hyades, and one outlying star in the e_Plx variable, indicating that its measurements are not reliable. We exclude both points:

   x5=x4 & (DE>0) & (e_Plx<5)
   pairs(hip[x5,-c(1,5)],pch=20)

Note that the x5 variable, a vector of TRUE and FALSE, may be summed to reveal the number of TRUE's.

   sum(x5)

As a final look at these data, let's consider the original plot of Vmag versus B.V but make the 92 stars we just identified look bigger (pch=20 instead of 46) and color them red (col=2 instead of 1):

   plot(Vmag,B.V,pch=c(46,20)[1+x5],
      col=1+x5)


Linear least-squares regression

Consider the relationship between DE and pmDE among the 92 stars identified above:

   plot(DE[x5],pmDE[x5],pch=20)

If we wish to fit a linear regression (least squares) line to this plot, we may do so using the lm (linear model) function. Note the "response ~ predictor(s)" format used in formulas for functions like lm .

   m=lm(pmDE[x5] ~ DE[x5])
   abline(m,lwd=2,col=2)

The m object just created is an object of class lm. The class of an object in R can help to determine how it is treated by functions such as print and summary.

   m # same as print(m)
   summary(m)

There is a lot of information contained in m that is not displayed by print or summary:

   names(m)
   plot(m$fit,m$resid)

The residual plot produced above reveals no irregularities. Note that when referring to the items in a list by name, as in m$fit or m$resid above, it is only necessary to type enough of the name to uniquely identify it. 


Other types of regression

Let's try a non-linear fit, given by loess or lowess:

   m2=lowess(DE[x5], pmDE[x5])
   lines(m2,col=4,lwd=2)

We can also try resistant regression, using the lqs function in the MASS package.

   library(MASS)
   m3=lqs(pmDE[x5]~DE[x5])
   abline(m3,col=5, lwd=2)

Finally, let's use least absolute deviation regression. To do this, we'll use a function called rq (regression quantiles) in the "quantreg" package. This package is not part of the standard distribution of R, so we'll need to download it. In order to do this, we must tell R where to store the installed library using the install.packages function.

   install.packages("quantreg",lib="c:/",
      CRAN="http://lib.stat.cmu.edu/R/CRAN")
   library(quantreg, lib.loc="c:/")

Assuming the quantreg package is loaded, we may now compare the least-squares fit (red) with the least absolute deviations fit (green). In this example, the two fits are nearly identical:

   m4=rq(pmDE[x5]~DE[x5])
   abline(m7,col=6, lwd=2)


Random numbers in R

It is often necessary to simulate random numbers in R. There are many functions available to accomplish this and related tasks such as evaluating the density, distribution function, and quantile function of a distribution.

Distributions intrinsic to R

R handles many common distributions easily. To see a list, type

   help.search("distribution",package="stats")

Let's consider the well-known normal distribution as an example:

   ?Normal

The four functions 'rnorm', 'dnorm', 'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential distribution function), the normal cumulative distribution function (CDF), and the inverse of the normal CDF (also called the quantile function), respectively.

Suppose we wish to simulate a large number of normal random variables with mean 10 and standard deviation 3, then check a histogram against two normal density functions, one based on the true parameters and one based on estimates, to see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale as the density functions (freq=FALSE).

   z=rnorm(200000, mean=10, sd=3)
   hist(z,freq=FALSE,nclass=100)
   x=seq(min(z),max(z),len=200)
   lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3)
   lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))

We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean, a common cutoff used by physical scientists. We can also see the true theoretical proportion:

   sum(abs((z-10)/3)>3)/length(z)
   2*pnorm(-3)

In the first line above, we are using sum to count the number of TRUE's in the logical vector (abs((z-10)/3)>3). This works because logical values are coerced to 0's and 1's when necessary.


The power-law or Pareto distribution

A commonly used distribution in astrophysics is the power-law distribution, more commonly known in the statistics literature as the Pareto distribution. There are no built-in R functions for dealing with this distribution, but because it is an extremely simple distribution it is easy to write such functions.

The density function for the Pareto is f(x)=aba/x(a+1) for x>b. Here, a and b are fixed positive parameters, where b is the minimum possible value. As an example, consider the log N = -1.5 * log S relationship, where S is the apparent brightness and N is the number of standard candles randomly located in transparent space. Thus, a Pareto distribution with (a+1) = 1.5 is a reasonable, if simplistic, model for the brightness of observed standard candles in space. The b parameter merely reflects the choice of units of measurement.

It turns out that a Pareto random variable is simply b*exp(X), where X is an exponential random variable with rate=a (i.e., with mean=1/a). However, rather than exploiting this simple relationship, we wish to build functions for the Pareto distribution from scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.

   dpareto=function(x, a=0.5, b=1) a*b^a/x^(a+1)

Next, we integrate the density function to obtain the distribution function, which is F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):

   ppareto=function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)

Note that (x > b) in the above function is coerced to numeric, either 0 or 1.

Inverting the distribution function gives the quantile function. The following simplistic function is wrong unless 0<u<1, so a better-designed function should do some error-checking.

   qpareto=function(u, a=0.5, b=1) b/(1-u)^(1/a)

Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function is applied to a uniform random variable, the result is a random variable with the desired distribution:

   rpareto=function(n, a=0.5, b=1) qpareto(runif(n),a,b)

Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of a function may be given a default value, which is used whenever the user calls the function without specifying the value of this parameter. Also note that each of the above functions consists of only a single line; however, longer functions may be created by enclosing them inside curly braces { }.


A few simple plots

The commands below create plots related to the four functions just created.

   par(mfrow=c(2,2))
   x=seq(1,50,len=200)
   plot(x,dpareto(x),type="l")
   plot(x,ppareto(x),type="l",lty=2)
   u=seq(.005,.9,len=200)
   plot(u,qpareto(u),type="l",col=3)
   z=rpareto(200)
   dotchart(log10(z), main="200 random logged Pareto deviates",
      cex.main=.7)
   par(mfrow=c(1,1))

The above commands illustrate some of the many plotting capabilities of R. The par function sets many graphical parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix of plots, set here to two rows and two columns. In the plot commands, 'type' is set here to "l" for a line plot; other common options are "p" for points (the default), "b" for connected dots, and "n" for nothing (to create axes only). Other options used: 'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red, 3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type '? par' to see a list of the many plotting parameters and options.


Some hypothesis testing procedures in R

This module demonstrates some of the many statistical tests that R can perform. It is impossible to give an exhaustive list of such testing functionality, but we hope not only to provide several examples but also to elucidate some of the logic of statistical hypothesis tests with these examples.


T tests, permutation-based and otherwise

Earlier, we used exploratory techniques to identify 92 stars from the Hipparcos data set that are associated with the Hyades.  Now, we will compare these Hyades stars with the remaining stars in the Hipparcos dataset on the basis of the B.V (B minus V) variable. That is, we are comparing the groups in the boxplot below:

   boxplot(B.V~x5,notch=T,ylab="B minus V")

For ease of notation, we define vectors H and nH (for "Hyades" and "not Hyades") that contain the data values for the two groups.

   H=B.V[x5]
   nH=B.V[!x5 & !is.na(B.V)]

In the definition of nH above, we needed to exclude the NA values (there are no NAs among the 92 Hyades stars here).

A two-sample t-test may be performed with a single line:

   t.test(H,nH)

For purposes of illustration, we may obtain the same results without resorting to the t.test function:

   v1=var(H)/92
   v2=var(nH)/2586
   tstat=(mean(H)-mean(nH))/sqrt(v1+v2)
   tstat

To test the null hypothesis, this t statistic is compared to a t distribution. In a Welch test, we assume that the variances of the two populations are not necessarily equal, and the degrees of freedom of the t distribution are computed using the so-called Satterthwaite approximation:

   (v1 + v2)^2 / (v1^2/91 + v2^2/2585)

The two-sided p-value is now given by:

   2*pt(tstat,97)

In this particular example, the Welch test is probably not necessary, since the sample variances are so close that an assumption of equal variances is warranted.  Thus, we might instead use:

   t.test(H,nH,var.equal=T)


Empirical distribution functions

The empirical distribution function (EDF) of a sample is, by definition, the cumulative distribution function for the discrete distribution represented by the sample itself -- that is, the distribution that puts mass 1/n on each of the n sample points. We may graph the EDF using the ecdf function:

   plot(ecdf(H))

While it is generally very difficult to interpret the EDF directly, it is possible to compare an EDF to a theoretical cumulative distribution function or two another EDF. Among the statistical tests that implement such a comparison is the Kolmogorov-Smirnov test, which is implemented by the R function ks.test.

   ks.test(tlist,"pnorm")
   ks.test(H,nH)

Whereas the first result above can be somewhat surprising (a small p-value means a statistically significant difference), the second result should not be; we already saw that H and nH have statistically significantly different means. However, if we center each, we obtain

   ks.test(H-mean(H),nH-mean(nH))

In other words, the Kolmogorov-Smirnov test finds no statistically significant evidence that the distribution of B.V for the Hyades stars is anything other than a shifted version of the distribution of B.V for the other stars.  We'll say more about this later.


Chi-squared tests for categorical data

We begin with some boxplots of the Vmag variable broken into groups according to the B minus V variable:

   bvcat=cut(B.V, breaks=c(-Inf,.5,.75,1,Inf))
   boxplot(Vmag~bvcat, varwidth=T,
      ylim=c(max(Vmag),min(Vmag)),
      xlab=expression("B minus V"),
      ylab=expression("V magnitude"),
      cex.lab=1.4, cex.axis=.8)

The cut values for the newly created categorical variable bvcat are based roughly on the quartiles of the B.V variable.  Recall that x5, the Hyades indicator, was created earlier. Here is a summary of the dataset based only on these two variables:

   table(bvcat,x5)

Note that the Vmag variable is irrelevant in the table above.

To perform a chi-squared test of the null hypothesis that the true population proportions falling in the four categories are the same for both the Hyades and non-Hyades stars, use the chisq.test function:

   chisq.test(bvcat,x5)

Since we already know (from the t-test above) that these two groups differ with respect to the B.V variable, the result of this test is not too surprising. But it does give a qualitatively different way to compare these two distributions than simply comparing their means.

The p-value produced above is based on the fact that the chi-squared statistic is approximately distributed like a true chi-squared distribution (on 3 degrees of freedom, in this case) if the null hypothesis is true. However, it is possible to obtain exact p-values, if one wishes to calculate the chi-squared statistic for all possible tables of counts with the same row and column sums as the given table. Since this is rarely practical computationally, the exact p-value may be approximated using a Monte Carlo method. Such a method is implemented in the chisq.test function:

   chisq.test(bvcat,x5,sim=T,B=50000)

Thus, the first method produces the exact value of an approximate p-value, whereas the second method produces an approximation to the exact p-value!

The test above is usually called a chi-squared test of homogeneity. If we observe only one sample, but we wish to test whether the categories occur in some pre-specified proportions, a similar test (and the same R function) may be applied. In this case, the test is usually called a chi-squared test of goodness-of-fit.

Principal components

We first load a new data set, this one relating to a group of quasars in the Sloan digital sky survey:

   quas=read.table( "http://astrostatistics.psu.edu/datasets/SDSS_quasar.dat",
      header=T)
   dim(quas)

There are many missing values. Since missing values create havoc with covariance matrices, we first eliminate all rows with missing values:

   quas[quas==0 | quas==-1 | quas==-9]=NA
   quas=quas[!apply(quas,1,  function(a) any(is.na(a))),]
   dim(quas)

This leaves us with a much smaller dataset, but for purposes of illustration it will serve well. 

The goal of principal components is to reduce the number of variables by finding the "best" linear combinations of all existing variables.  To understand what is "best" in this context, consider the 22 quantitative measurement columns in the quas dataset (the first column is the SDSS designation of the object, not a quantitative measurement). Each row may be considered to be a point in 22-dimensional Euclidean space. Thus, the entire dataset consists of a cloud of 279 22-dimensional points. The "best" linear combination here will be the single vector in 22-space parallel to which the variance of these 279 points is the greatest. The second-best will be the single vector orthogonal to the first along which the variance is the greatest, and so on.

To create a single object containing all the principal components information you will need (after eliminating the first column), type

   pc=princomp(quas[,-1])

Let's see what kind of information is carried in pc.

   names(pc)
   ?princomp

The most important bits, namely pc$sdev and pc$loadings, are obtained directly from an eigenvalue/eigenvector decomposition of the sample covariance matrix.   If one invokes the princomp command with cor=TRUE, then the eigen decomposition is performed on the correlation matrix rather than the covariance matrix. Which method is more appropriate in this case? To answer this question, let's examine the standard deviations of the columns of quas:

   sqrt(apply(quas[,-1],2,var))

Note that the variation of the R.A and Dec. columns is far larger than that of any other column. Thus, we should not be surprised if these two columns dominate the first two principal components. In fact, since these two columns together with z give position of the object, we might want to extract them from the principal components analysis altogether, retaining them unchanged in the reduced data set. However, we could essentially put all variables on an equal footing in terms of variability by using the correlation rather than the covariance (this is equivalent to standardizing each of the variables to have standard deviation equal to 1 before performing the princomp analysis).

The two most important pieces of information in a principal components analysis are the variances explained (eigenvalues) and variable loadings (eigenvectors). The former may be viewed graphically using a technique called a screeplot:

   screeplot(pc)

In the above plot, we see that the variance of the first two PCs dwarfs the others. To see what this means, we must look at the loadings for these first two PCs:

   loadings(pc)

This last command prints a lot of information. Scroll up to see the loadings of components 1 and 2, with any loadings less than 0.1 in absolute value suppressed as unimportant. (In reality, the loadings for the first principal component are a vector of real numbers, scaled so that the sum of their squares equals 1. Each element in the vector gives the relative weight of the corresponding variable.)

To see all of the loadings for the first principal component (and only those), type

   pc$load[,1]

We may conclude from the output above that the first principal component consists essentially of nothing other than R.A (recall that the standard deviation of R.A was much larger than that of the other variables, so this result is really not surprising).

It is also unsurprising to see that the second principal component consists almost entirely of the Declination:

   pc$load[,2]

These two principal components together comprise over 99.8% of the total variance of these variables, which makes it difficult to see easily the effect of the remaining principal components. As explained earlier, one way to deal with the problem of variables on vastly different scales is by analyzing the correlation matrix rather than the covariance matrix. However, in this case, the two variables causing the trouble are easy to identify; thus, we'll proceed by performing a new principal components analysis on the remaining columns of quas after R.A and Dec. are removed:

   pc2=princomp(quas[,-(1:3)])
   screeplot(pc2)

In the new screeplot, we see three or four PCs with relatively large variance, one to four PCs with moderate variance, and the rest with relatively small variance. Let's see what the variable loadings for these first five PCs are:

   loadings(pc2)

Again, it is necessary to scroll up to see the important output.

Examining these loadings, the first PC is somewhat difficult to interpret, but the second is basically an average of all the "_mag" variables. Notably, the three variables (u_mag, g_mag, r_mag) always occur with roughly the same weights in the first few PCs, indicating that we may replace these three with a single variable equal to their mean. The same is true of (i_mag, z_mag) and (J_mag, H_mag, K_mag). We could thus reduce these 8 variables to 3.

Another approach is to analyze only the principal component scores themselves, which are contained in pc$scores. This 279x22 matrix contains exactly the same information as the original dataset, but the axes have been rotated so that the first axis is the most important in explaining information, followed by the second, etc. Based on our analysis, only 5 or 6 of these PCs should be very variable.

   pairs(pc$scores[,1:6],pch=".")

The drawback to the above plots, of course, is that many of them are difficult to interpret.

In summary, principal components provides an objective way to decide, based on data alone, how to reduce the dimensionality of a dataset to ease interpretability. However, substantive astronomical knowledge should be at least as important as such considerations (e.g., if M_i is known to be important, then maybe it should be kept regardless of what PC analysis says).

 

Bootstrapping



Nonparametric bootstrapping of regression standard errors

Here, we revisit the Hipparcos data set and the 92 stars identified as Hyades stars.  Earlier, we considered a linear model for the relationship between DE and pmDE among these 92 stars:

   plot(DE[x5],pmDE[x5],pch=20)
   m=lm(pmDE[x5] ~ DE[x5])
   abline(m,lwd=2,col=2)

The red line on the plot is the usual least-squares line, for which estimation is easy and asymptotic theory gives easy-to-calculate standard errors for the coefficients:

   summary(m)$coef

However, suppose we wish to use a resistant regression method such as lqs.

   library(MASS)
   m2=lqs(pmDE[x5] ~ DE[x5])
   abline(m2,lwd=2,col=3)
   m2

In this case, it is not so easy to obtain standard errors for the coefficients. Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap, we repeatedly draw samples of size 92 from the empirical distribution of the data, which in this case consist of the (DE, pmDE) pairs. We use lqs to fit a line to each sample, then compute the sample covariance of the resulting coefficient vectors. The procedure works like this:

   x=DE[x5]
   y=pmDE[x5]
   m2B=NULL
   for (i in 1:200) {
      s=sample(92,replace=T)
      m2B=rbind(m2B, lqs(y[s]~x[s])$coef)
   }

We may now find the sample covariance matrix for m2B. The (marginal) standard errors of the coefficients are obtained as the square roots of the diagonal entries of this matrix:

   cov(m2B)
   se=sqrt(diag(cov(m2B)))
   se

The logic of the bootstrap procedure is that we are estimating an approximation of the true standard errors. The approximation involves replacing the true distribution of the data (unknown) with the empirical distribution of the data. This approximation may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical distribution of the data is known and in principle we may sample from it as many times as we wish. In other words, as the bootstrap sample size increases, we get a better estimate of the true value of the approximation. On the other hand, the quality of this approximation depends on the original sample size (92, in our example) and there is nothing we can do to change it.

An alternative way to generate a bootstrap sample in this example is by generating a new value of each response variable (y) by adding the predicted value from the original lqs model to a randomly selected residual from the original set of residuals. Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you might try implementing this approach in R. Note that this approach is not a good idea if you have reason to believe that the distribution of residuals is not the same for all points. For instance, if there is heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling approach is not warranted.


Using the boot package in R

There is a boot package in R that contains many functions relevant to bootstrapping. As a quick example, we will show here how to obtain the same kind of bootstrap example obtained above (for the lqs model of pmDE regressed on DE for the Hyades stars.)

   library(boot)
   mystat=function(a,b)
      lqs(a[b,2]~a[b,1])$coef
   m2B2 = boot(cbind(DE[x5],pmDE[x5]),
      mystat, 200)
   names(m2B2)

As explained in the help file, the boot function requires as input a function that accepts as arguments the whole dataset and an index that references an observation from that dataset. This is why we defined the mystat function above. To see the output that is similar to that obtained earlier for the m2B object, look in m2B2$t:

   cov(m2B2$t)
   sqrt(diag(cov(m2B2$t)))

Compare with the output provided by print.boot and the plot produced by plot.boot:

   m2B2
   plot(m2B2)

Another related function, for producing bootstrap confidence intervals, is boot.ci.


Parametric bootstrapping of regression standard errors

We now return to the regression problem studied earlier.

Sometimes, resampling is done from a theoretical distribution rather than from the original sample. For instance, if simple linear regression is applied to the regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals, namely, normal with mean zero and standard deviation estimated from the regression:

   summary(m)

Remember that m was defined above as lm(pmDE[x5]~DE[x5]). We observe a residual standard error of 4.449.

A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the model

   y = 21.9 - 3.007*DE[x5] + 4.449*rnorm(92)

Then, we refit a linear model using y as the new response, obtaining slightly different values of the regression coefficients. If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients for this model.

Naturally, the same approach could be tried with other regression methods such as those discussed in the EDA and regression module, but careful thought should be given to the parametric model used to generate the new residuals. In the normal case discussed here, the parametric bootstrap is simple, but it is really not necessary because standard linear regression already gives a very good approximation to the joint distribution of the regression coefficients when errors are heteroscedastic and normal. One possible use of this method is in a model that assumes the absolute residuals are exponentially distributed, in which case least absolute deviation regression as discussed in the EDA and regression module can be justified. The reader is encouraged to implement a parametric bootstrap using the rq function found in the "quantreg" package.


Estimating a p-value (parametric bootstrap for Kolmogorov-Smirnov)

Earlier, we conducted a t-test of the null hypothesis that the 92 Hyades stars have the same mean of B minus V as the 2586 non-Hyades stars.  We can test a related hypothesis without relying on the normal approximation of the t-test using a permutation test.   Because there are 2678!/(92!2586!)=3.78x10142 different ways to select 92 objects from 2678 objects, we have to depend on a random sample of such selections:

   H=B.V[x5]
   nH=B.V[!x5 & !is.na(B.V)]
   tlist=NULL
   all=c(H,nH)
   for(i in 1:5000) {
      s=sample(2586,92) # choose a sample
      tlist=c(tlist, t.test(all[s],all[-s],
        var.eq=T)$stat) # add t-stat to list
   }

Even though the sample in tlist appears to be very close to standard normal according to a histogram, the Kolmogorov-Smirnov test rejects the null hypothesis of standard normality:

   ks.test(tlist,"pnorm")

However, if we use a Kolmogorov-Smirnov test in which we allow the mean of the theoretical normal distribution to be estimated from the sample, we see no evidence of a departure from normality according to the usual output:

   newkstest=ks.test(tlist,"pnorm",mean=mean(tlist))
   newkstest

Note in the above function call that ks.test passes the "mean=" option directly to the pnorm function. The ks.test function does not know beforehand how many, if any, options might be passed to the function it requires; therefore, the function definition of ks.test uses three dots (...) in its option list to stand for some unspecified number of options.

Unfortunately, the Kolmogorov-Smirnov test is not quite valid if the theoretical distribution against which we are testing depends upon estimates derived from the data. (Imagine the extreme case: We could test the sample distribution against itself, which is simply an estimate derived from the data!) Therefore, we can get a better sense of how valid the second p-value is by using a Monte Carlo approach.

To carry out the Monte Carlo test, observe that the stat statistic for our K-S test by typing

   newkstest$stat

The value that we seek is the probability that X is at least as large as this observed statistic, where X is the random test statistic generated when we select a sample of size 5000 from the null distribution:

   ksstat=NULL
   for(i in 1:1000) {
      x=rnorm(5000)
      ksstat=c(ksstat,
         ks.test(x,pnorm,mean=mean(x))$stat)
   }

Now let's look at a histogram of the test statistics and estimate a p-value:

   hist(ksstat,nclass=40)
   ns=newkstest$stat
   abline(v=ns,lty=2,col=2)
   mean(ksstat>=ns)

Note that the approximate p-value above is smaller than the original p-value reported by newkstest, though it is still not small enough to provide strong evidence that the tlist sample is not normal with unit variance.

We conclude this section by briefly explaining the parenthetical remark "parametric bootstrap for Kolmogorov-Smirnov" in its title and discussing a related nonparametric bootstrap method. The above procedure is equivalent to a parametric bootstrap procedure in which we draw samples not from a standard normal, but from a normal distribution with mean equal to mean(tlist). This is because the "mean=mean(x)" option implies that we are applying a particular K-S test that is invariant to the mean of the distribution.

An interesting exercise is to implement a nonparametric bootstrap in this example that would be based only on resampling. In this case, we cannot merely repeatedly use

   s=sample(5000,replace=T)
   ks.test(tlist[s],pnorm,mean=mean(tlist[s]))$stat

The reason is because this statistic is biased. Therefore, it is necessary to correct for the bias. Unfortunately, this is not a trivial correction and it essentially requires a modified version of ks.test: Whereas the ordinary ks.test statistic is based on the maximum distance between the EDF of the sample and the theoretical CDF, the modified version must subtract off the bias before finding the maximum.