Part I: Introduction to R

R is the most comprehensive public-domain statistical software available today.  An implementation of the S language for statistical analysis, it is increasingly popular in the statistical community. It takes only a few minutes to obtain:  go to http://www.R-project.org , choose a CRAN mirror site, and download the files appropriate for your operating system.

The following tutorial is a series of excerpts from a weeklong set of tutorials presented at the Summer School in Statistics for Astronomers and Physicists at Penn State University and at Kavalur, India. When this demo is viewed online, the bolded, indented lines

   # like this one

are meant to be copied and pasted directly into R at the command prompt.
Enter R by typing "R" (UNIX) or double-clicking to execute Rgui.exe (Windows) or R.app (Mac).  In the commands below, we start by extracting some system and user information, the R.version you are using, and some of its capabilitiescitation tells how to cite R in publications.  R is released under the GNU Public Licence, as indicated by copyright. Typing a question mark in front of a command opens the help file for that command.

   Sys.info()
   R.version 
   capabilities() 
   citation() 
   ?copyright 
The various capitalizations above are important as R is case-sensitive. When using R interactively, it is very helpful to know that the up-arrow key can retrieve previous commands, which may be edited using the left- and right-arrow keys and the delete key.

The last command above, ?copyright, is equivalent to help(copyright) or help("copyright"). However, to use this command you have to know that the function called "copyright" exists. Suppose that you knew only that there was a function in R that returned copyright information but you could not remember what it was called. In this case, the help.search function provides a handy reference tool:
   help.search("copyright") 
Last but certainly not least, a vast array of documentation and reference materials may be accessed via a simple command:
   help.start() 
The initial working directory in R is set by default or by the directory from which R is invoked (if it is invoked on the command line). It is possible to read and set this working directory using the getwd or setwd commands. A list of the files in the current working directory is given by list.files, which has a variety of useful options and is only one of several utilities interfacing to the computer's files. In the setwd command, note that in Windows, path (directory) names are not case-sensitive and may contain either forward slashes or backward slashes; in the latter case, a backward slash must be written as "\\" when enclosed in quotation marks.
   getwd()
   list.files() # what's in this directory?
   # The # symbol means that the rest of that line is a comment.

Reading data into R

In the course of learning a bit about how to generate data summaries in R, one will inevitably learn some useful R syntax and commands. Thus, this first part on descriptive statistics serves a dual role as a brief introduction to R.

We wish to read an ASCII data file into an R object using the read.table command or one of its variants.    Let's begin with a cleaned-up version of the Hipparcos dataset described above, a description of which is given at http://astrostatistics.psu.edu/datasets/HIP_star.html. There are two distinct lines below that read the dataset and create an object named hip. The first (currently commented out) may be used whenever one has access to the internet; the second assumes that the HIP_star.dat file has been saved into the current working directory.
#   hip  <-  read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
#      header=T,fill=T) # T is short for TRUE
   hip  <-  read.table("HIP_star.dat", header=T,fill=T)
The "<-", which is actually "less than" followed by "minus", is the R assignment operator. Admittedly, this is a bit hard to type repeatedly, so fortunately R also allows the use of a single equals sign (=) for assignment.

Note that no special character must be typed when a command is broken across lines as in the example above. Whenever a line is entered that is not yet syntactically complete, R will replace the usual prompt, ">" with a + sign to indicate that more input is expected. The read.table function can refer to a location on the web, though a filename (of a file in the working directory) or a pathname would have sufficed. The "header=TRUE" option is used because the first row of the file is a header containing the names of the columns. We used the "fill=TRUE" option because some of the columns have only 8 of the 9 columns filled, and "fill=TRUE" instructs R to fill in blank fields at the end of the line with missing values, denoted by the special R constant NA ("not available"). Warning: This only works in this example because all of the empty cells are in the last column of the table. (You can verify this by checking the ASCII file HIP_star.dat.) Because the read.table function uses delimiters to determine where to break between columns, any row containing only 8 values would always put the NA in the 9th column, regardless of where it was intended to be placed. As a general rule, data files with explicit delimiters are to be preferred to files that use line position to indicate column number, particularly when missing data are present. If you must use line position, R provides the read.fortran and read.fwf functions for reading fixed width format files.

Summarizing the dataset

The following R commands list the dimensions of the dataset and print the variable names (from the single-line header).  Then we list the first row, the first 20 rows for the 7th column, and the sum of the 3rd column. 
   dim(hip)
   names(hip) 
   hip[1,]
   hip[1:20,7]
   sum(hip[,3])
Note that vectors, matrices, and arrays are indexed using the square brackets and that "1:20" is shorthand for the vector containing integers 1 through 20, inclusive. Even punctuation marks such as the colon have help entries, which may be accessed using help(":").

Next, list the maximum, minimum, median, and median absolute deviation (similar to standard deviation)  of each column.  First we do this using a for-loop, which is a slow process in R.  Inside the loop, c is a generic R function that combines its arguments into a vector and print is a generic R command that prints the contents of an object. After the inefficient but intuitively clear approach using a for-loop, we then do the same job in a more efficient fashion using the apply command.  Here the "2" refers to columns in the x array; a "1" would refer to rows.
   for(i in 1:ncol(hip)) {
      print(c(max(hip[,i]), min(hip[,i]), median(hip[,i]), mad(hip[,i]))) 
   }
   apply(hip, 2, max) 
   apply(hip, 2, min) 
   apply(hip, 2, median) 
   apply(hip, 2, mad)
The curly braces {} in the for loop above are optional because there is only a single command inside. Notice that the output gives only NA for the last column's statistics. This is because a few values in this column are missing. We can tell how many are missing and which rows they come from as follows:
   sum(is.na(hip[,9]))
   which(is.na(hip[,9]))
There are a couple of ways to deal with the NA problem. One is to repeat all of the above calculations on a new R object consisting of only those rows containing no NAs:
   y  <-  na.omit(hip)
   for(i in 1:ncol(y)) {
      print(c(max(y[,i]), min(y[,i]), median(y[,i]), mad(y[,i]))) 
   }
Another possibility is to use the na.rm (remove NA) option of the summary functions. This solution gives slightly different answers from the the solution above; can you see why?
   for(i in 1:ncol(hip)) {
      print(c(max(hip[,i],na.rm=T), min(hip[,i],na.rm=T), median(hip[,i],na.rm=T), mad(hip[,i],na.rm=T))) 
   }
A vector can be sorted using the Shellsort or Quicksort algorithms; rank returns the order of values in a numeric vector; and order returns a vector of indices that will sort a vector. The last of these functions, order, is often the most useful of the three, because it allows one to reorder all of the rows of a matrix according to one of the columns:
   sort(hip[1:10,3])
   hip[order(hip[1:10,3]),]
Each of the above lines gives the sorted values of the first ten entries of the third column, but the second line reorders each of the ten rows in this order. Note that neither of these commands actually alters the value of x, but we could reassign x to equal its sorted values if desired.

More R syntax

Arithmetic in R is straightforward.  Some common operators are: + for addition, - for subtraction, * for multiplication, / for division, %/% for integer division,  %% for modular arithmetic, ^ for exponentiation.  The help page for these operators may accessed by typing, say,
   ?'+'
Some common built-in functions are exp for the exponential function, sqrt for square root, log10 for base-10 logarithms, and cos for cosine. The syntax resembles "sqrt(z)".   Comparisons are made using < (less than), <= (less than or equal), == (equal to) with the syntax "a >= b".  To test whether a and b are exactly equal and return a TRUE/FALSE value (for instance, in an "if" statement), use the command identical(a,b) rather a==b.  Compare the following two ways of comparing the vectors a and b:
   a <- c(1,2);b <- c(1,3)
   a==b
   identical(a,b)
Also note that in the above example, 'all(a==b)' is equivalent to 'identical(a,b)'.

R also has other logical operators such as & (AND), | (OR), ! (NOT). There is also an xor (exclusive or) function. Each of these four functions performs elementwise comparisons in much the same way as arithmetic operators:
   a <- c(TRUE,TRUE,FALSE,FALSE);b <- c(TRUE,FALSE,TRUE,FALSE)
   !a
   a & b
   a | b
   xor(a,b)
However, when 'and' and 'or' are used in programming, say in 'if' statements, generally the '&&' and '||' forms are preferable. These longer forms of 'and' and 'or' evaluate left to right, examining only the first element of each vector, and evaluation terminates when a result is determined. Some other operators are listed here.

The expression "y == x^2" evaluates as TRUE or FALSE, depending upon whether y equals x squared, and performs no assignment (if either y or x does not currently exist as an R object, an error results).

Let's continue with simple characterization of the dataset: find the row number of the object with the smallest value of the 4th column using which.min.  A longer, but instructive, way to accomplish this task creates a long vector of logical constants (tmp), mostly FALSE with one TRUE, then pick out the row with "TRUE".
   which.min(hip[,4])
   tmp <- (hip[,4]==min(hip[,4])) 
   (1:nrow(hip))[tmp] # or equivalently, 
   which(tmp)
The cut function divides the range of x into intervals and codes the values of x according to which interval they fall.  It this is a quick way to group a vector into bins.  Use the "breaks" argument to either specify a vector of bin boundaries, or give the number of intervals into which x should be cut.  Bin string labels can be specified.  Cut converts numeric vectors into an R object of class  "factor" which can be ordered and otherwise manipulated; e.g. with command levels.  A more flexible method for dividing a vector into groups using user-specified rules is given by split.
   table(cut(hip[,"Plx"],breaks=20:25))
The command above uses several tricks. Note that a column in a matrix may be referred to by its name (e.g., "Plx") instead of its number. The notation '20:25' is short for 'c(20,21,22,23,24,25)' and in general, 'a:b' is the vector of consecutive integers starting with a and ending with b (this also works if a is larger than b). Finally, the table command tabulates the values in a vector or factor.

Although R makes it easy for experienced users to invoke multiple functions in a single line, it may help to recognize that the previous command accomplishes the same task as following string of commands:
   p <- hip[,"Plx"]
   cuts <- cut(p,breaks=20:25)
   table(cuts)
The only difference is that the string of three separate commands creates two additional R objects, p and cuts. The preferred method of carrying out these operations depends on whether there will later be any use for these additional objects.

Univariate plots

Recall the variable names in the Hipparcos dataset using the names function. By using attach, we can automatically create temporary variables with these names (these variables are not saved as part of the R session, and they are superseded by any other R objects of the same names).
   names(hip)
   attach(hip)
After using the attach command, we can obtain, say, individual summaries of the variables:
   summary(Vmag)
   summary(B.V)
Next, summarize some of this information graphically using a simple yet sometimes effective visualization tool called a dotplot or dotchart, which lets us view all observations of a quantitative variable simultaneously:
   dotchart(B.V)
The shape of the distribution of the B.V variable may be viewed using a traditional histogram. If we use the prob=TRUE option for the histogram so that the vertical axis is on the probability scale (i.e., the histogram has total area 1), then a so-called kernel density estimate, or histogram smoother, can be overlaid:
   hist(B.V,prob=T)
   d <- density(B.V,na.rm=T)
   lines(d,col=2,lwd=2,lty=2)
The topic of density estimation will be covered later. For now, it is important to remember that even though histograms and density estimates are drawn in two-dimensional space, they are intrinsically univariate analysis techniques here: We are only studying the single variable B.V in this example (though there are multivariate versions of these techniques as well).

Check the help file for the par function (by typing "?par") to see what the col, lwd, and lty options accomplish in the lines function above.

A simplistic histogram-like object for small datasets, which both gives the shape of a distribution and displays each observation, is called a stem-and-leaf plot. It is easy to create by hand, but R will create a text version:
   stem(sample(B.V,100))
The sample command was used above to obtain a random sample of 100, without replacement, from the B.V vector.

Finally, we consider box-and-whisker plots (or "boxplots") for the four variables Vmag, pmRA, pmDE, and B.V (the last variable used to be B-V, or B minus V, but R does not allow certain characters). These are the 2nd, 6th, 7th, and 9th columns of 'hip'.
   boxplot(hip[,c(2,6,7,9)])
Our first attempt above looks pretty bad due to the different scales of the variables, so we construct an array of four single-variable plots:
   par(mfrow=c(2,2))
   for(i in c(2,6,7,9)) 
      boxplot(hip[,i],main=names(hip)[i])
   par(mfrow=c(1,1))
The boxplot command does more than produce plots; it also returns output that can be more closely examined. Below, we produce boxplots and save the output.
   b <- boxplot(hip[,c(2,6,7,9)])
   names(b)
'b' is an object called a list. To understand its contents, read the help for boxplot. Suppose we wish to see all of the outliers in the pmRA variable, which is the second of the four variables in the current boxplot:
   b$names[2]
   b$out[b$group==2]

Scatterplots

Let us now examine the bivariate relationship between Vmag and B minus V using a scatter plot.
   plot(Vmag,B.V)
The above plot looks too busy because of the default plotting character, set let's use a different one:
   plot(Vmag,B.V,pch=".")
Let's now use exploratory scatterplots to locate the Hyades stars. This open cluster should be concentrated both in the sky coordinates RA and DE, and also in the proper motion variables pm_RA and pm_DE. We start by noticing a concentration of stars in the RA distribution:
   plot(RA,DE,pch=".")
See the cluster of stars with RA between 50 and 100 and with DE between 0 and 25?
   rect(50,0,100,25,border=2)
Let's construct a logical (TRUE/FALSE) variable that will select only those stars in the appropriate rectangle:
   filter1 <-  (RA>50 & RA<100 & DE>0 & DE<25)
Next, we select based on the proper motions. (As our cuts through the data are parallel to the axes, this variable-by-variable classification approach is sometimes called Classification and Regression Trees or CART, a very common multivariate classification procedure.)
   plot(pmRA[filter1],pmDE[filter1],pch=20)
   rect(0,-150,200,50,border=2)
Let's replot after zooming in on the rectangle shown in red.
   plot(pmRA[filter1],pmDE[filter1],pch=20, xlim=c(0,200),ylim=c(-150,50))
   rect(90,-60,130,-10,border=2)
   filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10) # Space in 'pmDE< -10' is necessary!
   filter  <-  filter1 & filter2
Let's have a final look at the stars we have identified using the pairs command to produce all bivariate plots for pairs of variables. We'll exclude the first and fifth columns (the HIP identifying number and the parallax, which is known to lie in a narrow band by construction).
   pairs(hip[filter,-c(1,5)],pch=20)
Notice that indexing a matrix or vector using negative integers has the effect of excluding the corresponding entries.

We see that there is one outlying star in the e_Plx variable, indicating that its measurements are not reliable. We exclude this point:
   filter <- filter & (e_Plx<5)
   pairs(hip[filter,-c(1,5)],pch=20)
How many stars have we identified? The filter variable, a vector of TRUE and FALSE, may be summed to reveal the number of TRUE's (summation causes R to coerce the logical values to 0's and 1's).
   sum(filter)
As a final look at these data, let's consider the HR plot of Vmag versus B.V but make the 92 Hyades stars we just identified look bigger (pch=20 instead of 46) and color them red (col=2 instead of 1). This shows the Zero Age Main Sequence, plus four red giants, with great precision.
   plot(Vmag,B.V,pch=c(46,20)[1+filter], col=1+filter,
      xlim=range(Vmag[filter]), ylim=range(B.V[filter]))

R scripts

While R is often run interactively, one often wants to carefully construct R scripts and run them later.  A file containing R code can be run using the source command. In addition, R may be run in batch mode.

The editor Emacs, together with "Emacs speaks statistics", provides a nice way to produce R scripts.

Part II: Hypothesis testing and estimation

This part demonstrates a few of the many statistical tests that R can perform. It is impossible to give an exhaustive list of such testing functionality, but we hope not only to provide several examples but also to elucidate some of the logic of statistical hypothesis tests with these examples.

T tests

Earlier, we used exploratory techniques to identify 92 stars from the Hipparcos data set that are associated with the Hyades. We did this based on the values of right ascension, declination, principal motion of right ascension, and principal motion of declination. We then excluded one additional star with a large error of parallax measurement:
   hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
      header=T, fill=T)
#   hip <- read.table("HIP_star.dat", header=T, fill=T)
   attach(hip)
   filter1 <-  (RA>50 & RA<100 & DE>0 & DE<25)
   filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10) 
   filter  <-  filter1 & filter2 & (e_Plx<5) 
   sum(filter)
In this section, we will compare these Hyades stars with the remaining stars in the Hipparcos dataset on the basis of the color (B minus V) variable. That is, we are comparing the groups in the boxplot below:
   color <- B.V
   boxplot(color~filter,notch=T)
For ease of notation, we define vectors H and nH (for "Hyades" and "not Hyades") that contain the data values for the two groups.
   H <- color[filter]
   nH <- color[!filter & !is.na(color)]
In the definition of nH above, we needed to exclude the NA values.

A two-sample t-test may now be performed with a single line:
   t.test(H,nH)
Because it is instructive and quite easy, we may obtain the same results without resorting to the t.test function. First, we calculate the variances of the sample means for each group:
   v1 <- var(H)/92
   v2 <- var(nH)/2586
   c(var(H),var(nH))
The t statistic is based on the standardized difference between the two sample means. Because the two samples are assumed independent, the variance of this difference equals the sum of the individual variances (i.e., v1+v2). Nearly always in a two-sample t-test, we wish to test the null hypothesis that the true difference in means equals zero. Thus, standardizing the difference in means involves subtracting zero and then dividing by the square root of the variance:
   tstat <- (mean(H)-mean(nH))/sqrt(v1+v2)
   tstat
To test the null hypothesis, this t statistic is compared to a t distribution. In a Welch test, we assume that the variances of the two populations are not necessarily equal, and the degrees of freedom of the t distribution are computed using the so-called Satterthwaite approximation:
   (v1 + v2)^2 / (v1^2/91 + v2^2/2585)
The two-sided p-value may now be determined by using the cumulative distribution function of the t distribution, which is given by the pt function.
   2*pt(tstat,97.534)
Incidentally, one of the assumptions of the t-test, namely that each of the two underlying populations is normally distributed, is almost certainly not true in this example. However, because of the central limit theorem, the t-test is robust against violations of this assumption; even if the populations are not roughly normally distributed, the sample means are.

In this particular example, the Welch test is probably not necessary, since the sample variances are so close that an assumption of equal variances is warranted. Thus, we might conduct a slightly more restrictive t-test that assumes equal population variances. Without going into the details here, we merely present the R output:
   t.test(H,nH,var.equal=T)

Kernel density estimation redux

Let's take another look at the color (B minus V) variable:
   hist(color,prob=T)
   d <- density(color,na.rm=T)
   lines(d,col=2,lwd=2,lty=2)
Now we'll use a contributed CRAN package to add confidence limits to the density estimate. The package is called "sm":
   install.packages ("sm")  # R will ask you to choose a mirror site
   library(sm)
We will use the "sm.density" function:
   ?sm.density
   bandwidth <- bw.nrd0(color[!is.na(color)])
   d <- sm.density(color[!is.na(color)],
      h=bandwidth, add=TRUE) # Adds to current plot
   lines(d$eval.points, d$upper, col=3, lwd=2)
   lines(d$eval.points, d$lower, col=3, lwd=2)

Chi-squared tests for categorical data

We begin by creating, albeit artificially, a second categorical variable ("filter", the Hyades indicator, is the first), which we will base on the color (B minus V) variable:
   bvcat <- cut(color, breaks=c(-Inf,.5,.75,1,Inf))
The cut values for bvcat are based roughly on the quartiles of the B minus V variable. Here is a summary of the dataset based only on these two categorical variables:
   table(bvcat,filter)
To perform a chi-squared test of the null hypothesis that the true population proportions falling in the four categories are the same for both the Hyades and non-Hyades stars, use the chisq.test function:
   chisq.test(bvcat,filter)
Since we already know these two groups differ with respect to the B.V variable, the result of this test is not too surprising. But it does give a qualitatively different way to compare these two distributions than simply comparing their means.

The p-value produced above is based on the fact that the chi-squared statistic is approximately distributed like a true chi-squared distribution (on 3 degrees of freedom, in this case) if the null hypothesis is true. However, it is possible to obtain exact p-values, if one wishes to calculate the chi-squared statistic for all possible tables of counts with the same row and column sums as the given table. Since this is rarely practical computationally, the exact p-value may be approximated using a Monte Carlo method. Such a method is implemented in the chisq.test function:
   chisq.test(bvcat,filter,sim=T,B=50000)
The two different p-values we just generated a numerically similar but based on entirely different mathematics. The difference may be summed up as follows: The first method produces the exact value of an approximate p-value, whereas the second method produces an approximation to the exact p-value!

The test above is usually called a chi-squared test of homogeneity. If we observe only one sample, but we wish to test whether the categories occur in some pre-specified proportions, a similar test (and the same R function) may be applied. In this case, the test is usually called a chi-squared test of goodness-of-fit.

Maximum likelihood estimation in a simple case

Let's now switch to a new dataset, one that comes from NASA's Swift satellite. The statistical problem at hand is modeling the X-ray afterglow of gamma-ray bursts. First, read in the dataset:
   grb <- read.table ("http://astrostatistics.psu.edu/datasets/GRB_afterglow.dat", 
      header=T, skip=1)
#   grb <- read.table ("GRB_afterglow.dat", header=T, skip=1)
The skip=1 option in the previous statement tells R to ignore the first row in the data file. You will see why this is necessary if you look at the file.

For now, we just consider the second column, which are X-ray fluxes:
   flux <- grb[,2]
   hist(flux)
The histogram suggests that the univariate distribution has roughly the shape of an exponential distribution (we'll speak more about what this means later). Let us replot these data in a particular (and particularly common) manner besides the histogram that is also suggestive of an exponential distribution.

As a first step, let us calculate something akin to the (x,y) coordinates of the empirical distribution function -- the function that has a jump of size 1/n at every one of the sorted data points.
   n <- length(flux)
   xx <- sort(flux)
   yy <- (1:n)/n
We could now obtain the empirical cdf by connecting the (xx,yy) points using a stairstep pattern. However, we'll look at these points slightly differently.

The exponential distribution has a distribution function given by F(x) = 1-exp(-x/mu) for positive x, where mu>0 is a scalar parameter equal to the mean of the distribution. This implies among other things that log(1-F(x)) = -x/mu is a linear function of x in which the slope is the negative reciprocal of the mean. Let us then look for the characteristic linear pattern if we plot log(1-F(x)) against x using the empirical distribution function for F (subtracting a small amount from yy avoids taking the log of zero):
   plot(xx, log(1-yy+1/n), xlab="flux", 
      ylab="log(1-F(flux))")
The plot certainly looks linear, so let us proceed on the assumption that the flux data are a sample from an exponential distribution with unknown parameter mu.

The overriding question of this section is this: How shall we estimate mu?

As mentioned above, mu is equal to the mean of this population. For a quick refresher on some probability theory, let us recall why this is so: The first step in going from the distribution function F(x) = 1 - exp(-x/mu) to the mean, or expectation, is to obtain the density function by differentiating: f(x) = exp(-x/mu)/mu. Notice that we typically use F(x) to denote the distribution function and f(x) to denote the density function. Next, we integrate x*f(x) over the interval 0 to infinity, which gives the mean, mu.

Since mu is the population mean, it is intuitively appealing to simply estimate mu using the sample mean. This method, in which we match the population moments to the sample moments and then solve for the parameter estimators, is called the method of moments. Though it is a well-known procedure, we focus instead on a much more widely used method (for good reason) called maximum likelihood estimation.

The first step in maximum likelihood estimation is to write down the likelihood function, which is nothing but the joint density of the dataset viewed as a function of the parameters. Next, we typically take the log, giving what is commonly called the loglikelihood function. Remember that all logs are natural logs unless specified otherwise.

The loglikelihood function in this case is (with apologies for the awkward notation)
l(mu) = -n log(mu) - x1/mu - ... - xn/mu
A bit of calculus reveals that l(mu) is therefore maximized at the sample mean. Thus, the sample mean is not only the method of moments estimator in this case but the maximum likelihood estimate as well.

In practice, however, it is sometimes the case that the linear-looking plot produced earlier is used to estimate mu. As we remarked, the negative reciprocal of the slope should give mu, so there is a temptation to fit a straight line using, say, least-squares regression, then use the resulting slope to estimate mu.
   mean (flux) # This is the MLE
   m1 <- lm(log(1-yy+1/n) ~ xx)
   m1
   -1/m1$coef[2] # An alternative estimator
There is a possible third method that I am told is sometimes used for some kinds of distributions. We start with a histogram, which may be viewed as a rough approximation of the density:
   h <- hist(flux)
All of the information used to produce the histogram is now stored in the h object, including the midpoints of the bins and their heights on a density scale (i.e., a scale such that the total area of the histogram equals one).

To see how to use this information, note that the logarithm of the density function is log f(x) = -log(mu) - x/mu, which is a linear function of x. Thus, plotting the logarithm of the density against x might be expected to give a line.
   counts <- h$counts
   dens <- h$density[counts>0]
   midpts <- h$mids[counts>0]
   plot(midpts, log(dens))
When using linear regression to estimate the slope of the linear pattern just produced, I am told that it is standard to weight each point by the number of observations it represents, which is proportional to the reciprocal of the variance of the estimated proportion of the number of points in that bin. We can obtain both the weighted and unweighted versions here. We can then obtain an estimate of mu using either the intercept, which is -log(mu), or the slope, which is -1/mu:
   m1 <- lm(log(dens) ~ midpts)
   m2 <- lm(log(dens) ~ midpts,
      weights=counts[counts>0])
   exp(-m1$coef[1]) # This is one estimate
   -1/m1$coef[2] # This is another
   exp(-m2$coef[1]) # Yet another
   -1/m2$coef[2] # And another
We have thus produced no fewer than six different estimators of mu (actually seven, except that the MLE and the method of moments estimator are the same in this case). How should we choose one of them?

There are a couple of ways to answer this question. One is to appeal to statistical theory. The method of maximum likelihood estimation is backed by a vast statistical literature that shows it has certain properties that may be considered optimal. The method of moments is also a well-established method, but arguably with less general theory behind it than the method of maximum likelihood. The regression-based methods, on the other hand, are all essentially ad hoc.

A second way to choose among estimators is to run a simulation study in which we repeatedly simulate datasets (whose parameters are then known to us) and test the estimators to see which seems to perform best. In order to do this, we will need to be able to generate random numbers, which is a topic covered below.

Standard error of the MLE

Based on asymptotic theory (i.e., the mathematics of statistical estimators as the sample size tends to infinity), we know that for many problems, the sampling distribution of the maximum likelihood estimator (for theta) is roughly normally distributed with mean theta and variance equal to the inverse of the Fisher information of the dataset. For an i.i.d. sample of size, the Fisher information is defined to be n times the mean of the square of the first derivative of the log-density function. Equivalently, it is -n times the mean of the second derivative of the log-density function. In many cases, it is easier to use the second derivative than the square of the first derivative.

In the previous example, the log-density function is -log(mu) - x/mu. The second derivative with respect to the parameter is 1/mu^2 - 2x/mu^3. To find the Fisher information, consider x to be a random variable; we know its expectation is mu, so the expectation of the second derivative equals -1/mu^2. We conclude that the Fisher information equals n/mu^2. Intuitively, this means that we get more "information" about the true value of mu when this value is close to zero than when it is far from zero. For the exponential distribution with mean mu, this makes sense.

Earlier we calculated the MLE. Let's give it a name:
   mu.hat <- mean (flux) # The MLE found earlier
The standard error of mu.hat, based on the Fisher information, is (the square root of) the inverse of the F.I. evaluated at mu.hat:
   sqrt(mu.hat^2/n) # SE based on (expected) information
   mu.hat + 1.96 * c(-1,1) * sqrt(mu.hat^2/n) # approx. 95% CI
The F.I. calculated above is sometimes called the expected information because it involves an expectation. As an alternative, we can use what is called the observed information, which is the negative second derivative of the log-likelihood function. In this example, the loglikelihood function evaluated at the estimate mu.hat is equal to
   -n * log(mu.hat) - sum(flux) / mu.hat
and the negative second derivative of this function, evaluated at mu.hat, is
   -n / mu.hat^2 + 2*sum(flux) / mu.hat^3 # observed information at MLE
Notice in this case (though not in every model) that the observed information evaluated at the MLE is equal to the expected information evaluated at the MLE:
   n / mu.hat^2 # expected information at MLE
Do you see why?

Part III: Regression

This part demonstrates some of the capabilities of R for exploring relationships among two (or more) quantitative variables.

Linear and polynomial regression

Earlier, we used exploratory techniques to identify 92 stars from the Hipparcos data set that are associated with the Hyades. We did this based on the values of right ascension, declination, principal motion of right ascension, and principal motion of declination. We then excluded one additional star with a large error of parallax measurement:
   hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
      header=T, fill=T)
#   hip <- read.table("HIP_star.dat", header=T, fill=T)
   attach(hip)
   filter1 <-  (RA>50 & RA<100 & DE>0 & DE<25)
   filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10) 
   filter  <-  filter1 & filter2 & (e_Plx<5) 
   sum(filter)
Here is a quick example of linear regression relating BminusV to logL, where logL is the luminosity, defined to be (15 - Vmag - 5 log(Plx)) / 2.5. However, we'll use only the main-sequence Hyades to fit this model:
   mainseqhyades <- filter & (Vmag>4 | B.V<0.2)
   logL <- (15 - Vmag - 5 * log10(Plx)) / 2.5
   x <- logL[mainseqhyades]
   y <- B.V[mainseqhyades]
   plot(x, y)
   regline <- lm(y~x)   
   abline(regline, lwd=2, col=2)
   summary(regline)
Note that the regression line passes exactly through the point (xbar, ybar):
   points(mean(x), mean(y), col=3, pch=20, cex=3)
Here is a regression of y on exp(-x/4):
   newx  <-  exp(-x/4)
   regline2 <- lm(y~newx)
   xseq  <-  seq(min(x), max(x), len=250)
   lines(xseq, regline2$coef %*% rbind(1, exp(-xseq/4)), lwd=2, col=3)
Let's reconsider the gamma ray burst dataset.
   grb <- read.table(
      "http://astrostatistics.psu.edu/datasets/GRB_afterglow.dat",
      header=T, skip=1)
#   grb <- read.table("GRB_afterglow.dat", header=T, skip=1)
We will focus on the first two columns, which are times and X-ray fluxes:
   plot(grb[,1:2],xlab="time",ylab="flux")
This plot is very hard to interpret because of the scales, so let's take the log of each variable:
   x <- log(grb[,1])
   y <- log(grb[,2])
   plot(x,y,xlab="log time",ylab="log flux")
The relationship looks roughly linear, so let's try a linear model (lm in R):
   model1 <- lm(y~x)
   abline(model1, col=2, lwd=2)
The "response ~ predictor(s)" format seen above is used for model formulas in functions like lm .

The model1 object just created is an object of class "lm". The class of an object in R can help to determine how it is treated by functions such as print and summary.
   model1 # same as print(model1)
   summary(model1)
Notice the sigma-hat, the R-squared and adjusted R-squared, and the standard errors of the beta-hats in the output from the summary function.

There is a lot of information contained in model1 that is not displayed by print or summary:
   names(model1)
For instance, we will use the model1$fitted.values and model1$residuals information later when we look at some residuals plots.

Notice that the coefficient estimates are listed in a regression table, which is standard regression output for any software package. This table gives not only the estimates but their standard errors as well, which enables us to determine whether the estimates are very different from zero. It is possible to give individual confidence intervals for both the intercept parameter and the slope parameter based on this information, but remember that a line really requires both a slope and an intercept. Since our goal is really to estimate a line here, maybe it would be better if we could somehow obtain a confidence "interval" for the lines themselves.

This may in fact be accomplished. By viewing a line as a single two-dimensional point in (intercept, slope) space, we set up a one-to-one correspondence between all (nonvertical) lines and all points in two-dimensional space. It is possible to obtain a two-dimensional confidence ellipse for the (intercept,slope) points, which may then be mapped back into the set of lines to see what it looks like.

Performing all the calculations necessary to do this is somewhat tedious, but fortunately, someone else has already done it and made it available to all R users through CRAN, the Comprehensive R Archive Network. The necessary functions are part of the "car" (companion to applied regression) package. There are several ways to install the car package, but perhaps the most straightforward is by using the install.packages function. Once the car package is installed, its contents can be loaded into the current R session using the library function:
   install.packages("car")
   library(car)
If all has gone well, there is now a new set of functions, along with relevant documentation. Here is a 95% confidence ellipse for the (intercept,slope) pairs:
   confidence.ellipse(model1)
Remember that each point on the boundary or in the interior of this ellipse represents a line. If we were to plot all of these lines on the original scatterplot, the region they described would be a 95% confidence band for the true regression line. A graduate student, Derek Young, and I wrote a simple function to draw the borders of this band on a scatterplot. You can see this function at www.stat.psu.edu/~dhunter/R/confidence.band.r; to read it into R, use the source function:
   source("http://www.stat.psu.edu/~dhunter/R/confidence.band.r")
   confidence.band(model1)
In this dataset, the confidence band is so narrow that it's hard to see. However, the borders of the band are not straight. You can see the curvature much better when there are fewer points or more variation, as in:
   tmpx <- 1:10
   tmpy <- 1:10+rnorm(10) # Add random Gaussian noise
   confidence.band(lm(tmpy~tmpx))
Also note that increasing the sample size increases the precision of the estimated line, thus narrowing the confidence band. Compare the previous plot with the one obtained by replicating tmpx and tmpy 25 times each:
   tmpx25 <- rep(tmpx,25)
   tmpy25 <- rep(tmpy,25)
   confidence.band(lm(tmpy25~tmpx25))
A related phenomenon is illustrated if we are given a value of the predictor and asked to predict the response. Two types of intervals are commonly reported in this case: A prediction interval for an individual observation with that predictor value, and a confidence interval for the mean of all individuals with that predictor value. The former is always wider than the latter because it accounts for not only the uncertainty in estimating the true line but also the individual variation around the true line. This phenomenon may be illustrated as follows. Again, we use a toy data set here because the effect is harder to observe on our astronomical dataset. As usual, 95% is the default confidence level.
   confidence.band(lm(tmpy~tmpx))
   predict(lm(tmpy~tmpx), data.frame(tmpx=7), interval="prediction")
   text(c(7,7,7), .Last.value, "P",col=4)
   predict(lm(tmpy~tmpx), data.frame(tmpx=7), interval="conf")
   text(c(7,7,7), .Last.value, "C",col=5)

Polynomial curve-fitting: Still linear regression!

Because there appears to be a bit of a bend in the scatterplot, let's try fitting a quadratic curve instead of a linear curve. Note: Fitting a quadratic curve is still considered linear regression. This may seem strange, but the reason is that the quadratic regression model assumes that the response y is a linear combination of 1, x, and x2. Notice the special form of the lm command when we implement quadratic regression. The I function means "as is" and it resolves any ambiguity in the model formula:
   plot(x,y,xlab="log time",ylab="log flux")
   model2 <- lm(y~x+I(x^2))
   summary(model2)
Here is how to find the estimates of beta using the closed-form solution:
   X <- cbind(1, x, x^2) # Create nx3 X matrix
   solve(t(X) %*% X) %*% t(X) %*% y # Compare to the coefficients above
Plotting the quadratic curve is not a simple matter of using the abline function. To obtain the plot, we'll first create a sequence of x values, then apply the linear combination implied by the regression model using matrix multiplication:
   abline(model1, col=2, lwd=2)
   xx <- seq(min(x),max(x),len=200)
   yy <- model2$coef %*% rbind(1,xx,xx^2)
   lines(xx,yy,lwd=2,col=3)

Model selection using AIC and BIC

Let's compare the AIC and BIC values for the linear and the quadratic fit. Without getting too deeply into details, the idea behind these criteria is that we know the model with more parameters (the quadratic model) should achieve a higher maximized log-likelihood than the model with fewer parameters (the linear model). However, it may be that the additional increase in the log-likelihood statistic achieved with more parameters is not worth adding the additional parameters. We may test whether it is worth adding the additional parameters by penalizing the log-likeilhood by subtracting some positive multiple of the number of parameters. In practice, for technical reasons we take -2 times the log-likelihood, add a positive multiple of the number of parameters, and look for the smallest resulting value. For AIC, the positive multiple is 2; for BIC, it is the natural log of n, the number of observations. We can obtain both the AIC and BIC results using the AIC function. Remember that R is case-sensitive, so "AIC" must be all capital letters.
   AIC(model1)
   AIC(model2)
The value of AIC for model2 is smaller than that for model1, which indicates that model2 provides a better fit that is worth the additional parameters. However, AIC is known to tend to overfit sometimes, meaning that it sometimes favors models with more parameters than they should have. The BIC uses a larger penalty than AIC, and it often seems to do a slightly better job; however, in this case we see there is no difference in the conclusion:
   n <- length(x)
   AIC(model1, k=log(n))
   AIC(model2, k=log(n))

Other methods of curve-fitting

Let's try a nonparametric fit, given by loess or lowess. First we plot the linear (red) and quadratic (green) fits, then we overlay the lowess fit in blue:
   plot(x,y,xlab="log time",ylab="log flux")
   abline(model1, lwd=2, col=2)
   lines(xx, yy, lwd=3, col=3)
   npmodel1 <- lowess(y~x)
   lines(npmodel1, col=4, lwd=2)
It is hard to see the pattern of the lowess curve in the plot. Let's replot it with no other distractions. Notice that the "type=n" option to plot function causes the axes to be plotted but not the points.
   plot(x,y,xlab="log time",ylab="log flux", type="n")
   lines(npmodel1, col=4, lwd=2)
This appears to be a piecewise linear curve. An analysis that assumes a piecewise linear curve will be carried out on these data later in the week.

In the case of non-polynomial (but still parametric) curve-fitting, we can use nls. If we replace the response y by the original (nonlogged) flux values, we might posit a parametric model of the form flux = exp(a+b*x), where x=log(time) as before. Compare a nonlinear approach (in red) with a nonparametric approach (in green) for this case:
   flux <- grb[,2]
   nlsmodel1 <- nls(flux ~ exp(a+b*x), start=list(a=0,b=0))
   print(s <- summary(nlsmodel1))
   npmodel2 <- lowess(flux~x)
   plot(x, flux, xlab="log time", ylab="flux")
   lines(xx, exp(s$coef[1]+s$coef[2]*xx), col=2, lwd=2)
   lines(npmodel2, col=3, lwd=2)
Interestingly, the coefficients of the nonlinear least squares fit are different than the coefficients of the original linear model fit on the logged data, even though these coefficients have exactly the same interpretation: If flux = exp(a + b*x), then shouldn't log(flux) = a + b*x? The difference arises because these two fitting methods calculate (and subsequently minimize) the residuals on different scales. Try plotting exp(a + b*xx) on the scatterplot of x vs. flux for both (a,b) solutions to see what happens. Next, try plotting a + b*xx on the scatterplot of x vs. log(flux) to see what happens.

If outliers appear to have too large an influence over the least-squares solution, we can also try resistant regression, using the lqs function in the MASS package. The basic idea behind lqs is that the largest residuals (presumably corresponding to "bad" outliers) are ignored. The results for our log(flux) vs. log(time) example look terrible but are very revealing. Can you understand why the output from lqs looks so very different from the least-squares output?
   library(MASS)
   lqsmodel1 <- lqs(y~x, method="lts")
   plot(x,y,xlab="log time",ylab="log flux")
   abline(model1,col=2)
   abline(lqsmodel1,col=3)
Let us now consider least absolute deviation regression, which may be considered a milder form of resistant regression than lqs. In least absolute deviation regression, even large residuals have an influence on the regression line (unlike in lqs), but this influence is less than in least squares regression. To implement it, we'll use a function called rq (regression quantiles) in the "quantreg" package. Like the "car" package, this package is not part of the standard distribution of R, so we'll need to download it. In order to do this, we must tell R where to store the installed library using the install.packages function.
   install.packages("quantreg") 
   library(quantreg)
Assuming the quantreg package is loaded, we may now compare the least-squares fit (red) with the least absolute deviations fit (green). In this example, the two fits are nearly identical:
   rqmodel1 <- rq(y~x)
   plot(x,y,xlab="log time",ylab="log flux")
   abline(model1,col=2)
   abline(rqmodel1,col=3)
We conclude by mentioning some well-studied and increasingly popular methods of penalized least squares. These include ridge regression (in which the penalty is proportional to the L2 norm of the parameter vector) and LASSO (which uses an L1 norm instead of an L2 norm). There are also other penalized least sqaures methods, perhaps most notably the SCAD (smoothly clipped absolute deviation) penalty. For an implementation of ridge regression, see the lm.ridge function in the MASS package. To use this function, you must first type library(MASS). For a package that implements LASSO, check out, e.g., the lasso2 package on CRAN.



Part IV: Techniques based on randomization

This part deals with randomization and some techniques based on randomization. These include a simulation study whose goal is to evaluate several competing estimators of the same quantity and a method of approximating error distributions called bootstrapping.

Generating random numbers in R

First, some semantics: "Random numbers" does not refer solely to uniform numbers between 0 and 1, though this is what "random numbers" means in some contexts. We are mostly interested in generating non-uniform random numbers here. If you are interested in the details of uniform (pseudo-)random number generation in R, check out
   ?.Random.seed
R handles many common distributions easily. To see a list, type
   help.search("distribution", package="stats")
Let's consider the well-known normal distribution as an example:
   ?Normal
The four functions 'rnorm', 'dnorm', 'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential distribution function), the normal cumulative distribution function (CDF), and the inverse of the normal CDF (also called the quantile function), respectively. Almost all of the other distributions have similar sets of four functions. The 'r' versions are rbeta, rbinom, rcauchy, rchisq, rexp, rf, rgamma, rgeom, rhyper, rlogis, rlnorm, rmultinom, rnbinom, rnorm, rpois, rsignrank, rt, runif, rweibull, and rwilcox (there is no rtukey because generally only ptukey and qtukey are needed).

As an example, suppose we wish to simulate a vector of 10 independent, standard (i.e., mean 0 and standard deviation 1) normal random variables. We use the rnorm function for this purpose, and its defaults are mean=0 and standard deviation=1. Thus, we may simply type
   rnorm(10)
Suppose we wish to simulate a large number of normal random variables with mean 10 and standard deviation 3, then check a histogram against two normal density functions, one based on the true parameters and one based on estimates, to see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale as the density functions (freq=FALSE).
   z <- rnorm(200000, mean=10, sd=3)
   hist(z,freq=FALSE,nclass=100)
   x <- seq(min(z),max(z),len=200)
   lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3) 
   lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))
We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean, a common cutoff used by physical scientists. We can also see the true theoretical proportion:
   sum(abs((z-10)/3)>3)/length(z)
   2*pnorm(-3)
In the first line above, we are using sum to count the number of TRUE's in the logical vector (abs((z-10)/3)>3). This works because logical values are coerced to 0's and 1's when necessary.

The function dnorm has a closed form: With mean=0 and sd=1, dnorm(x) equals exp(-x2/2)/sqrt(2*pi). By contrast, the CDF, given by pnorm, has no closed form and must be numerically approximated. By definition, pnorm(x) equals the integral of dnorm(t) as t ranges from minus infinity to x. To find a p-value (i.e., the probability of observing a statistic more extreme than the one actually observed), we use pnorm; to construct a confidence interval (i.e., a range of reasonable values for the true parameter), we use the inverse, qnorm.
   pnorm(1:3)-pnorm(-(1:3))
   qnorm(c(.05,.95))
The first line above summarizes the well-known 68, 95, 99.7 rule for normal distributions (these are the approximate proportions lying within 1, 2, and 3 standard deviations from the mean). The second line gives the critical values used to construct a 90% confidence interval for a parameter when its estimator is approximately normally distributed.

Let us now briefly consider an example of a discrete distribution, which means a distribution on a finite or countably infinite set (as opposed to a continuous distribution like the normal). The Poisson distribution, which has a single real-valued parameter lambda, puts all of its probability mass on the nonnegative integers. A Poisson distribution, often used to model data consisting of counts, has mean and variance both equal to lambda.
   k <- 0:10
   dpois(k,lambda=2.5) # or equivalently,
   exp(-2.5)*2.5^k/factorial(k)
Next, simulate some Poisson variables:
   x <- rpois(10000,lambda=2.5)
   table(x)
   mean(x)
   var(x)

The power-law or Pareto distribution

A commonly used distribution in astrophysics is the power-law distribution, more commonly known in the statistics literature as the Pareto distribution. There are no built-in R functions for dealing with this distribution, but because it is an extremely simple distribution it is easy to write such functions.

The density function for the Pareto is f(x)=aba/x(a+1) for x>b. Here, a and b are fixed positive parameters, where b is the minimum possible value. As an example, consider the log N = -1.5 * log S relationship, where S is the apparent brightness and N is the number of standard candles randomly located in transparent space. Thus, a Pareto distribution with (a+1) = 1.5 is a reasonable, if simplistic, model for the brightness of observed standard candles in space. The b parameter merely reflects the choice of units of measurement.

As another example, consider the Salpeter function, the simple but widely known expression of the initial mass function (IMF), in which the mass of a randomly selected newly formed star has a Pareto distribution with parameter a=1.35.

It turns out that a Pareto random variable is simply b*exp(X), where X is an exponential random variable with rate=a (i.e., with mean=1/a). However, rather than exploiting this simple relationship, we wish to build functions for the Pareto distribution from scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.
   dpareto <- function(x, a=0.5, b=1) a*b^a/x^(a+1)
Next, we integrate the density function to obtain the distribution function, which is F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):
   ppareto <- function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)
Note that (x > b) in the above function is coerced to numeric, either 0 or 1.

Inverting the distribution function gives the quantile function. The following simplistic function is wrong unless 0<u<1, so a better-designed function should do some error-checking.
   qpareto <- function(u, a=0.5, b=1) b/(1-u)^(1/a)
Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function is applied to a uniform random variable, the result is a random variable with the desired distribution:
   rpareto <- function(n, a=0.5, b=1) qpareto(runif(n),a,b)
Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of a function may be given a default value, which is used whenever the user calls the function without specifying the value of this parameter. Also note that each of the above functions consists of only a single line; however, longer functions may be created by enclosing them inside curly braces { }.

A few simple plots

The commands below create plots related to the four functions just created.
   par(mfrow=c(2,2))
   x <- seq(1,50,len=200)
   plot(x,dpareto(x),type="l")
   plot(x,ppareto(x),type="l",lty=2)
   u <- seq(.005,.9,len=200)
   plot(u,qpareto(u),type="l",col=3)
   z <- rpareto(200)
   dotchart(log10(z), main="200 random logged Pareto deviates",
      cex.main=.7)
   par(mfrow=c(1,1))
The above commands illustrate some of the many plotting capabilities of R. The par function sets many graphical parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix of plots, set here to two rows and two columns. In the plot commands, 'type' is set here to "l" for a line plot; other common options are "p" for points (the default), "b" for connected dots, and "n" for nothing (to create axes only). Other options used: 'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red, 3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type '? par' to see a list of the many plotting parameters and options.

A simulation study

Let us posit that the random variable X has a Pareto distribution with parameters a=1.35 and b=1. We will simulate multiple datasets with this property and then apply several different estimation methods to each. To simplify matters, we will assume that the value of b is known, so that the goal is estimation of a.

To evaluate the estimators, we will look at their mean squared error (MSE), which is just what it sounds like: The average of the squared distances from the estimates to the true parameter value of a=1.35.

To illustrate the estimators we'll evaluate, let's start by simulating a single dataset of size 100:
   d <- rpareto(100, a=1.35)
Here are the estimators we'll consider:
  1. The maximum likelihood estimator. Since the density with b=1 is given by f(x) = a/x^(a+1), the loglikelihood function is
    l(a) = n log(a) - (a+1)(log x1 + log x2 + ... + log xn).
    The maximizer may be found using calculus to equal n/(log x1 + ... + log xn). For our dataset, this may be found as follows:
       1/mean(log(d))
    
    We used the sum of logarithms above where we could have used the equivalent mathematical expression given by the log of the product. Sometimes the former method gives more numerically stable answers for very large samples, though in this case "100/log(prod(d))" gives exactly the same answer.

  2. The method of moments estimator. By integrating, we find that the mean of the Pareto distribution with b=1 is equal to a/(a-1). (This fact requires that a be greater than 1.) Setting a/(a-1) equal to the sample mean and solving for a gives 1/(1-1/samplemean) as the estimator.
       1/(1-1/mean(d))
    
  3. What we'll call the EDF (empirical distribution function) estimator. Since log(1-F(x)) equals -a log(x) when b=1, by plotting the sorted values of log(d) against log(n/n), log((n-1)/n), ..., log(1/n), we should observe roughly a straight line. We may then use least-squares regression to find the slope of the line, which is our estimate of -a:
       lsd <- log(sort(d))
       lseq <- log((100:1)/100)
       plot(lsd, lseq)
       tmp <- lm(lseq ~ lsd)
       abline(tmp,col=2)
       -tmp$coef[2]
    
  4. What we'll call the unweighted histogram estimator. Since log f(x) equals log(a) - (a+1) log(x) when b=1, if we plot the values of log(d) against histogram-based estimates of the log-density function, we should observe roughly a straight line with slope -(a+1) and intercept log(a). Let's use only the slope, since that is the feature that is most often the focus of a plot that is supposed to illustrate a power-law relationship.
       hd <- hist(d,nclass=20,plot=F)
       counts <- hd$counts
       ldens <- log(hd$density[counts>0])
       lmidpts <- log(hd$mids[counts>0])
       plot(lmidpts, ldens)
       tmp <- lm(ldens~lmidpts)
       abline(tmp,col=2)
       -1-as.numeric(tmp$coef[2])
    
  5. What we'll call the weighted histogram estimator. Exactly the same as the unweighted histogram estimator, but we'll estimate the slope using weighted least squares instead of ordinary least squares. The weights should be proportional to the bin counts.
       plot(lmidpts, ldens)
       tmp <- lm(ldens~lmidpts, 
          weights=counts[counts>0])
       abline(tmp,col=2)
       -1-as.numeric(tmp$coef[2])
    
Now let's write a single function that will take a vector of data as its argument, then return all five of these estimators.
   five <- function(d) {
      lsd <- log(sort(d))
      n <- length(d)
      lseq <- log((n:1)/n)
      m1 <- lm(lseq ~ lsd)$coef
      hd <- hist(d,nclass=n/5,plot=F)
      counts <- hd$counts
      ldens <- log(hd$density[counts>0])
      lmidpts <- log(hd$mids[counts>0])
      m2 <- lm(ldens~lmidpts)$coef
      m3 <- lm(ldens~lmidpts,
         weights=counts[counts>0])$coef
      out <- c(max.lik=1/mean(log(d)), 
         meth.mom=1/(1-1/mean(d)),
         EDF=-as.numeric(m1[2]), 
         unwt.hist=-1-as.numeric(m2[2]),
         wt.hist=-1-as.numeric(m3[2]))
     return(out)
   }
The very last line of the function, "out", is the value that will be returned. (We could also have written "return(out)".) Let's test this function on our dataset:
   five(d)
There is no good way to compare these estimators based on a single sample like this. We now need to simulate multiple samples. Let's begin by taking n=100.
   n.100 <- NULL
   for(i in 1:250) { 
      dd <- rpareto(100, a=1.35)
      n.100 <- rbind(n.100, five(dd))
   }
Now we can get estimates of the biases of the estimators (their expectations minus the true parameter) and their variances. Note that we'll use the biased formula for the variance (i.e., the one that uses n instead of n-1 in the denominator) for a technical reason explained below.
   bias.100 <- apply(n.100,2,mean) - 1.35
   var.100 <- apply(n.100,2,var) * (249/250)
It is a mathematical identity that the mean squared error (MSE) equals the square of the bias plus the variance, as we may check numerically for (say) the first column of n.100. However, the identity only works if we use the biased formula for the variance, which is why we used the multiplier (249/250) above.
   mean((n.100[,1]-1.35)^2)
   bias.100[1]^2 + var.100[1]
Thus, we can construct the MSEs and view the results as follows:
   mse.100 <- bias.100^2 + var.100
   rbind(bias.100, var.100, mse.100)
Finally, let's repeat the whole experiment using samples of size 200.
   n.200 <- NULL
   for(i in 1:250) { 
      dd <- rpareto(200, a=1.35)
      n.200 <- rbind(n.200, five(dd))
   }
   bias.200 <- apply(n.200,2,mean) - 1.35
   var.200 <- apply(n.200,2,var) * (249/250)
   mse.200 <- bias.200^2 + var.200
   rbind(bias.200, var.200, mse.200)

Nonparametric bootstrapping of regression standard errors

This part deals with another randomization-based technique called bootstrapping. The term "bootstrapping" means something like improving oneself by one's own unaided efforts, and the statistical technique bearing the name has a bit of this flavor.

Earlier, we used exploratory techniques to identify 92 stars from the Hipparcos data set that are associated with the Hyades. We then looked at a regression relationship between color and log-luminosity for the 88 main sequence stars:
   hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
      header=T, fill=T)
#   hip <- read.table("HIP_star.dat", header=T, fill=T)
   attach(hip)
   filter1 <-  (RA>50 & RA<100 & DE>0 & DE<25)
   filter2 <- (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10) 
   filter  <-  filter1 & filter2 & (e_Plx<5) 
   sum(filter)
   mainseqhyades <- filter & (Vmag>4 | B.V<0.2)
   logL <- (15 - Vmag - 5 * log10(Plx)) / 2.5
   x <- logL[mainseqhyades]
   y <- B.V[mainseqhyades]
   plot(x,y,pch=20)
   model1 <- lm(y ~ x)
   abline(model1,lwd=2,col=2)
The red line on the plot is the usual least-squares line, for which estimation is easy and asymptotic theory gives easy-to-calculate standard errors for the coefficients:
   summary(model1)$coef
However, suppose we wish to use a resistant regression method such as lqs.
   library(MASS)
   model2 <- lqs(y ~ x)
   abline(model2,lwd=2,col=3)
   model2
In this case, it is not so easy to obtain standard errors for the coefficients. Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap, we repeatedly draw samples of size 92 from the empirical distribution of the data, which in this case consist of the (DE, pmDE) pairs. We use lqs to fit a line to each sample, then compute the sample covariance of the resulting coefficient vectors. The procedure works like this:
   model2B <- matrix(0,200,2)
   for (i in 1:200) {
      s <- sample(92,replace=T)
      model2B[i,] <- lqs(y[s]~x[s])$coef
   }
We may now find the sample covariance matrix for model2B. The (marginal) standard errors of the coefficients are obtained as the square roots of the diagonal entries of this matrix:
   cov(model2B)
   se <- sqrt(diag(cov(model2B)))
   se
The logic of the bootstrap procedure is that we are estimating an approximation of the true standard errors. The approximation involves replacing the true distribution of the data (unknown) with the empirical distribution of the data. This approximation may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical distribution of the data is known and in principle we may sample from it as many times as we wish. In other words, as the bootstrap sample size increases, we get a better estimate of the true value of the approximation. On the other hand, the quality of this approximation depends on the original sample size (92, in our example) and there is nothing we can do to change it.

An alternative way to generate a bootstrap sample in this example is by generating a new value of each response variable (y) by adding the predicted value from the original lqs model to a randomly selected residual from the original set of residuals. Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you might try implementing this approach in R. Note that this approach is not a good idea if you have reason to believe that the distribution of residuals is not the same for all points. For instance, if there is heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling approach is not warranted.

Using the boot package in R

There is a boot package in R, part of the base R distribution, that contains many functions relevant to bootstrapping. As a quick example, we will show here how to obtain the same kind of bootstrap example obtained above.
   library(boot)
   mystat <- function(a,b)
     lqs(a[b,2]~a[b,1])$coef
   model2B.2  <-  boot(cbind(x,y),
     mystat, 200)
   names(model2B.2)
As explained in the help file, the boot function requires as input a function that accepts as arguments the whole dataset and an index that references an observation from that dataset. This is why we defined the mystat function above. To see the output that is similar to that obtained earlier for the m2B object, look in m2B2$t:
   cov(model2B.2$t)
   sqrt(diag(cov(model2B.2$t)))
Compare with the output provided by print.boot and the plot produced by plot.boot:
   model2B.2
   plot(model2B.2)
Another related function, for producing bootstrap confidence intervals, is boot.ci.

Parametric bootstrapping of regression standard errors

We now return to the regression problem studied earlier.

Sometimes, resampling is done from a theoretical distribution rather than from the original sample. For instance, if simple linear regression is applied to the regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals, namely, normal with mean zero and standard deviation estimated from the regression:
   summary(model1)
Remember that model1 was defined above as lm(y~x). We observe a residual standard error of 0.0649.

A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the model
   y <- 0.747 - 0.407*x + 0.0649*rnorm(92)
Then, we refit a linear model using y as the new response, obtaining slightly different values of the regression coefficients. If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients for this model.

Naturally, the same approach could be tried with other regression methods such as those discussed above, but careful thought should be given to the parametric model used to generate the new residuals. In the normal case discussed here, the parametric bootstrap is simple, but it is really not necessary because standard linear regression already gives a very good approximation to the joint distribution of the regression coefficients when errors are heteroscedastic and normal. One possible use of this method is in a model that assumes the absolute residuals are exponentially distributed, in which case least absolute deviation regression as discussed earlier can be justified. The reader is encouraged to implement a parametric bootstrap using the rq function found in the "quantreg" package.

Part V: EM algorithms

The class of algorithms called EM algorithms is enormously important in statistics. There are many, many different specific algorithms that can be called EM algorithms, but they have this in common: They seek to iteratively maximize a likelihood function in a situation in which the data may be thought of as incompletely observed.

The name "EM algorithm" has its genesis in a seminal 1977 paper by Dempster, Laird, and Rubin in the Journal of the Royal Statistical Society, Series B. Many distinct algorithms published prior to 1977 were examples of EM, including the Lucy-Richardson algorithm for image deconvolution that is apparently quite well known in astronomy. The major contribution of Dempster et al was to unify these algorithms and prove certain facts about them. Interesting historical note: They even "proved" an untrue fact that was refuted six years (!) later (even thirty years ago, publications in statistics churned through the pipeline at a snail's pace).

An EM algorithm for right-censored measurements

We'll derive a simple EM algorithm on a toy example: We'll pretend that some of the gamma ray burst flux measurements were right-censored, as follows:
   grb <- read.table ("http://astrostatistics.psu.edu/datasets/GRB_afterglow.dat", 
      header=T, skip=1)
#   grb <- read.table ("GRB_afterglow.dat", header=T, skip=1)
   flux <- grb[,2]
   cflux <- flux
   cflux[flux>=60] <- 60
   n <- length(cflux)
   yy <- (1:n)/n
   plot(sort(cflux),log(1-yy+1/n))
The situation may be viewed like this: The complete dataset is a set of n observations from an exponential distribution with unknown mean mu, say, X1, ..., Xn. What we observe, however, is Z1, ..., Zn, where Zi is defined as min(Xi, 60). The loglikelihood for the observed data is as follows:
   m <- sum(flux>=60)
   s <- sum(cflux)
   loglik <- function(mu)
     -(n-m)*log(mu)-s/mu
As it turns out, this loglikelihood function can be maximized explicitly:
   mle <- s/(n-m)
However, we will construct an EM algorithm anyway for two reasons: First, it is instructive to see how the EM operates. Second, not all censoring problems admit a closed-form solution like this one does!

We start by writing down the complete-data loglikelihood for the sample. This is straightforward because the complete data are simply a random sample from an exponential distribution with mean mu. Next, we pick a starting value of mu, say mu0. Then comes the tricky part: We take the conditional expectation of the complete data loglikelihood, conditional on the observed data and assuming that mu0 is the correct parameter. (This will have to happen on the chalkboard!) The result is a function of both mu and mu0, and construction of this function is called the E (expectation) step. Next, we maximize this function over mu. The result of the maximization becomes our next iterate, mu1, and the process repeats.

Let's start with mu0=20. Carrying out the calculations described above yields the following iterative scheme:
   mu <- 20
   loglik(mu)
   mu <- s/n + m*mu/n; loglik(mu)
   mu <- s/n + m*mu/n; loglik(mu)
   # repeat the last line a few times
Notice that the value of the (observed data) loglikelihood increases at each iteration. This is the fundamental property of any EM algorithm! In fact, it is very helpful when debugging computer code, since there must be a bug somewhere whenever the loglikelihood is ever observed to decrease. Notice also that the value of mu has converged to the true maximum likelihood estimator after a few iterations.

An EM algorithm for a mixture: A simple case

Let's try a two-component mixture of normals model on a quasar absorption line dataset. We only want to read 104 lines of the text file, starting with line 2, and we are only interested in the second of the two columns:
   qso <- scan("http://www.astrostatistics.psu.edu/datasets/QSO_absorb.txt", 
      skip=1, nlines=104)[2*(1:104)]
#   qso <- scan("QSO_absorb.txt", skip=1, nlines=104)[2*(1:104)]
   hist(qso) # Get a look at the dataset.
Here is a function to implement an EM algorithm for a two-component mixture of normals, where the variances of the two components are assumed to be the same. The function may be modified to allow for different variances by using the commented-out lines rather than the ones preceding them:
twocomponentnormalmixtureEM <- function(mu1, mu2, sigsqrd, lambda, x) 
#twocomponentnormalmixtureEM <- function(mu1, mu2, sigsqrd1, sigsqrd2, lambda, x) 
{
  mx<-mean(x)
  nover2<-length(x)/2
  dl <- 1
  l<--Inf
  iterations<-0
  a1<-(x-mu1)^2; b1<-lambda*exp(-a1/2/sigsqrd)
#  a1<-(x-mu1)^2; b1<-lambda/sqrt(sigsqrd1)*exp(-a1/2/sigsqrd1)
  a2<-(x-mu2)^2; b2<-(1-lambda)*exp(-a2/2/sigsqrd)
#  a2<-(x-mu2)^2; b2<-(1-lambda)/sqrt(sigsqrd2)*exp(-a2/2/sigsqrd2)

  while (dl>1e-4) {
    iterations<-iterations+1
    postprobs <- b1/(b1+b2)
    lambda<-mean(postprobs)
    mu1<-mean(postprobs*x)/lambda
    mu2<-(mx-lambda*mu1)/(1-lambda)
    sigsqrd<-mean(postprobs*a1+(1-postprobs)*a2)
#    sigsqrd1<-mean(postprobs*a1); sigsqrd2<-mean((1-postprobs)*a2)

    oldl<-l    
    a1<-(x-mu1)^2; b1<-lambda*exp(-a1/2/sigsqrd)
#    a1<-(x-mu1)^2; b1<-lambda/sqrt(sigsqrd1)*exp(-a1/2/sigsqrd1)
    a2<-(x-mu2)^2; b2<-(1-lambda)*exp(-a2/2/sigsqrd)
#    a2<-(x-mu2)^2; b2<-(1-lambda)/sqrt(sigsqrd2)*exp(-a2/2/sigsqrd2)
    l<-sum(log(b1+b2)) - nover2*log(sigsqrd)
#    l<-sum(log(b1+b2))
    dl<-l-oldl
  }
  list(mu=c(mu1,mu2), variance=sigsqrd,
#  list(mu=c(mu1,mu2), variance=c(sigsqrd1, sigsqrd2),
     lambda=lambda, loglik=l, iterations=iterations)
}
Now let's try it:
   hist(qso, nclass=20)
   twocomponentnormalmixtureEM (.6, 1, .1, .1, qso)
This seems to converge to a sensible solution. However, try some other starting values and see what happens. If you find a different solution (i.e., a different local maximum), how does its loglikelihood value compare to that of the first solution?