Exploratory
Data Analysis and Regression
This module demonstrates some of the capabilities of R for exploring
univariate properties of quantitative variables and relationships among
two or more such variables.
Univariate
explorations
We begin by examining the Hipparcos dataset found at
http://astrostatistics.psu.edu/datasets/HIP_star.html.
Type
hip = read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
header=T,fill=T)
Recall the variable names in this dataset using the
names
function. By using
attach,
we can automatically create temporary variables with these names (these variables
are not saved as part of the R session, and they are superseded by any other R objects
of the same names).
names(hip)
attach(hip)
After using the attach command, we can obtain, say, individual
summaries
of the variables:
summary(Vmag)
summary(B.V)
We can also use
summary
on the entire 'hip' matrix.
summary(hip)
Next, summarize some of this information graphically using a
box-and-whisker plot
showing median, quartile and
outliers for the four variables Vmag, pmRA, pmDE, and B.V (the last variable
used to be B-V, or B minus V, but R does not allow certain characters). These
are the 2nd, 6th, 7th, and 9th columns of 'hip'.
boxplot(hip[,c(2,6,7,9)])
Our first attempt looks bad due to different scales of
the variables, so we construct an array of four single-variable plots:
par(mfrow=c(2,2))
for(i in c(2,6,7,9))
boxplot(hip[,i],main=names(hip)[i])
par(mfrow=c(1,1))
The
boxplot
command does more than produce plots; it also returns output that can be
more closely examined. Below, we use the same command as earlier, but we produce the
plot and save the output.
b=boxplot(hip[,c(2,6,7,9)])
names(b)
'b' is an object called a list. To understand its contents, read the help for
boxplot. Suppose we wish to see
all of the outliers in the pmRA variable, which is the second of the four variables
in the current boxplot:
b$names[2]
b$out[b$group==2]
Next, we'll make a more elaborate boxplot. Suppose we wish to examine the values of Vmag, with objects
broken into categories according to the B.V variable:
boxplot(Vmag~cut(B.V,breaks=(-1:6)/2),
notch=T, varwidth=T, las=1, tcl=.5,
xlab=expression("B minus V"),
ylab=expression("V magnitude"),
main="Can you find the red giants?",
cex=1, cex.lab=1.4, cex.axis=.8, cex.main=1)
axis(2, labels=F, at=0:12, tcl=-.25)
axis(4, at=3*(0:4))
The notches in the boxes, produced using "notch=T", can be used to test for differences in
the medians (see
boxplot.stats
for details). With "varwidth=T", the box widths are proportional to the square roots of
the sample sizes. The "cex" options all give scaling factors, relative to default:
"cex" is for plotting text and symbols, "cex.axis" is for axis annotation,
"cex.lab" is for the x and y labels, and
"cex.main" is for main titles.
The two
axis
commands are used to add an axis to the current plot. The first such command
above adds smaller tick marks at all integers, whereas the second one adds the
axis on the right.
Bivariate
plots
The
boxplot
in the plot above is telling us something about the bivariate relationship
between the two variables. Yet it is probably easier to grasp this relationship
by producing a
scatter plot.
plot(Vmag,B.V)
The above plot is a bit busy because of the default plotting character,
set let's use a different one:
plot(Vmag,B.V,pch=".")
Let's now produce the same graph but selecting just for Hyades stars. This open cluster
should be concentrated both in the sky coordinates RA and DE, and also in the proper
motion variables pm_RA and pm_DE. We start by noticing a concentration of stars in the
RA distribution:
plot(Vmag,RA,pch=".")
x1= (RA>50 & RA<100)
Next, we select in the proper motions. (As our cuts through the data
are parallel to the axes, this variable-by-variable classification approach
is sometimes called Classification and Regression
Trees or CART, a very common multivariate classification procedure.)
plot(pmRA[x1],pmDE[x1],pch=20)
plot(pmRA[x1],pmDE[x1],pch=20,
xlim=c(0,200),ylim=c(-200,100))
x2=(pmRA>90 & pmRA<130)
x3=(pmDE>-60 & pmDE< -10) # Space in '< -' is necessary!
x4=x1&x2&x3
After our selection,
we replot the HR diagram of Vmag vs. B.V. This shows the Zero Age Main Sequence, plus four
red giants, with great precision. Outliers in the original Hipparcos dataset have
been effectively removed. However, let's have a final look at the stars we have
identified using the
pairs
command to produce all bivariate plots for pairs of variables.
We'll exclude the first and fifth columns (the HIP identifying number
and the parallax, which is already known to lie in a narrow band).
pairs(hip[x4,-c(1,5)],pch=20)
Note that indexing a matrix or vector using negative integers has the effect of
excluding the corresponding entries.
We see that there is one outlying star in the DE variable, indicating that this
star is not actually one of the Hyades, and one outlying star in the e_Plx variable,
indicating that its measurements are not reliable. We exclude both points:
x5=x4 & (DE>0) & (e_Plx<5)
pairs(hip[x5,-c(1,5)],pch=20)
Note that the x5 variable, a vector of TRUE and FALSE, may be summed to reveal the
number of TRUE's.
sum(x5)
As a final look at these data, let's consider the original plot of Vmag versus B.V
but make the 92 stars we just identified look bigger (pch=20 instead of 46) and
color them red (col=2 instead of 1):
plot(Vmag,B.V,pch=c(46,20)[1+x5],
col=1+x5)
Linear
least-squares regression
Consider the relationship between DE and pmDE among the 92 stars identified above:
plot(DE[x5],pmDE[x5],pch=20)
If we wish to fit a linear regression (least squares) line to this plot, we may do
so using the
lm
(linear model) function. Note the "response ~ predictor(s)" format used
in formulas for functions like
lm .
m=lm(pmDE[x5] ~ DE[x5])
abline(m,lwd=2,col=2)
The m object just created is an object of class lm. The class of
an object in R can help to determine how it is treated by functions such as
print
and
summary.
m # same as print(m)
summary(m)
There is a lot of information contained in m that is not displayed by
print
or
summary:
names(m)
plot(m$fit,m$resid)
The residual plot produced above reveals no irregularities. Note that when referring
to the items in a list by name, as in m$fit or m$resid above, it is only necessary
to type enough of the name to uniquely identify it.
We now look at a different dataset, the SDSS quasar dataset. Download this dataset from
http://www.astrostatistics.psu.edu/datasets/SDSS_quasar.html.
quas=read.table(
"http://astrostatistics.psu.edu/datasets/SDSS_quasar.dat",
header=T)
dim(quas)
names(quas)
We won't need all of these columns. Let's keep only a subset.
quas=quas[,c("z","r_mag","i_mag",
"sig_i","Radio","X.ray","M_i")]
names(quas)
Creating all bivariate plots using
pairs
would take a bit of time with this size dataset; let's
do some exploration using a subset of 2000 rows, which we may
obtain using the
sample
function:
s=sample(nrow(quas),2000)
pairs(quas[s,-c(1,2)],pch=20)
There appear to be some strange outliers in columns like Radio and X.ray. It seems
that values of 0, -1, and -9 are intended to signify missing data.
Let's remove all 0's and -1's and -9's
by changing them to NA, then use
attach
to create new temporary variables with the same names as the columns and redo
the bivariate scatterplots. Remember,
any R objects with the same names as one of the columns will override these
temporary variables.
quas[quas==0 | quas==-1 | quas==-9]=NA
attach(quas)
pairs(quas[s,],pch=20)
Several of these plots look interesting. Let's take a look at the
relationship between z and M_i (using all 46420 points):
plot(z,M_i,pch=".")
abline(lm(M_i~z),col=2)
Clearly a straight line does not describe this relationship well. How about
a quadratic relationship? Note: fitting a quadratic curve to data is still
linear regression. This is because we are modelling the response (M_i) using a linear
combination of two variables (z and z2).
z2=z^2
m2=lm(M_i~z+z2)
x=seq(min(z),max(z),len=200)
y=predict(m2,data.frame(z=x,z2=x^2))
lines(x,y,lwd=2,col=3)
Note that we used the
seq
function to create a sequence of x-axis (predictor) values to plot.
We also used the
predict
function to obtain the y-axis values predicted by model m2 for the
point in the sequence x.
Finally, we used the
lines
function to add line segments to the current plot (in order, in
connect-the-dots fashion).
Other types of
regression
Let's try a non-linear fit, given by
loess or
lowess:
m3=lowess(z,M_i)
lines(m3,col=4,lwd=2)
Let's consider another bivariate relationship:
plot(X.ray,i_mag,pch=46)
abline(lm(i_mag~X.ray),col=2)
It is possible to apply weighted least squares. For instance, if the variances
of the response variables are known, a typical weight is the reciprocal of variance.
in this example, the difference is not large:
m4=lm(i_mag~X.ray,weights=1/sig_i^2)
abline(m4,col=3)
We can also try resistant regression, using the
lqs
function in the MASS package.
library(MASS)
m5=lqs(i_mag~X.ray)
abline(m5,col=4)
Finally, let's use least absolute deviation regression. To do this, we'll
use a function called
rq
(regression quantiles)
in the "quantreg" package. This package is
not part of the standard distribution of R, so we'll need to download it.
In order to do this, we must tell R where to store the installed library
using the
install.packages
function.
install.packages("quantreg",lib="v:/",
CRAN="http://lib.stat.cmu.edu/R/CRAN")
library(quantreg, lib.loc="v:/")
Assuming the quantreg package is loaded, we may now compare the least-squares fit (red)
with the least absolute deviations fit (green). In this example, the two
fits are nearly identical:
m6=lm(i_mag~X.ray)
m7=rq(i_mag~X.ray)
names(m7)
plot(X.ray,i_mag,pch=46)
abline(m6,col=2)
abline(m7,col=3)