# STAT 501  Applied Regression

Text's Data Sets (ASCII files)  (Also see discussion in Week 1 on how to access the data in a PC Lab on campus.)

Other Resources:
Regression Demo
Correlation Demo and more

## Final Exam:  Wed., May 6, 10:10-Noon,  117  THOMAS

Note:  I will be in the office after 2:00 on Monday, May 4, if you have any questions about the material for the final.  You can bring a single sheet of notes to the final.  The final is cumulative.  Logistic regression will not be on the final.

### Instructor:  Tom Hettmansperger

317 Thomas Bld
Phone:  865-2211
email:  tph@stat.psu.edu
Office Hours:  3:30-4:30 M, W

418 Thomas Bld
Phone:  865-3230
Office Hours:  10:10-11:10 T, Th

### Text:  Applied Linear Statistical Models, 4th ed.,  by Neter, Kutner, Nachtsheim, and Wasserman

Note:  You can also use Applied Linear Regression Models, 3rd ed., by Neter, Kutner, Nachtsheim, and Wasserman.  This is the first 15 chapter of Applied Linear Statistical Models, 4th ed.  In fact, if you are only planning to take Stat 501 and not Stat 502, then the Applied Linear Regression Models  is a lot easier to carry around.  The 4th edition is on reserve in  the Mathematics Library (in McAllister Bld.) so you can cross check earlier editions if necessary.

We will cover most of the material in the first 11 chapters (essentially parts 1 and 2 of the text), and selected material from later chapters if time permits.  The emphasis will be on the analysis of data and not on theory.  However, you will need to know some matrix manipulations and I will discuss this topic in class.  I will also take seriously the 6 credits in statistics or Stat 451 prerequisite.  See Appendix A in the text for a review of the prerequisite material, especially sections A.6 and A.7.

There will be 2 exams and a comprehensive final.  The exams will be worth 100 points each and the final will be worth 200 points.  Homework will be collected and graded but will only count if you are on the borderline.

The first exam will be after chapter 3 or 4.  The second exam will be after chapter 7.  This schedule is tentative and may be changed depending on the pace of the class.

The computer will be an important part of the course.  I will discuss and use Minitab.  You may use any good computer package (such as SAS) that implements the methods discussed in the course.  If you have no experience
with a computer, see me at once.

If time permits, I will assign an optional project which will be due sometime near the end of the semester.

I strongly suggest that you check the Stat 501 web page each week.  In addition, I will put comments on computing along with Minitab programs on the page.

Week 1.  Assignment:  Read sections 1.1-1.5.  Optional exercises:  p 36--1.6, 1.7, 1.11.

Flash:  There is an error in the formula for b1 in Wednesday's lecture.  I will correct it on Friday.

Here is the data for the muscle mass exercise 1.27,   p40.  Do this problem and hand it in on Monday, Jan. 19.

 age 64 43 67 56 73 68 56 76 65 45 58 45 53 49 78 71 mass 91 100 68 87 73 78 80 65 84 116 76 97 100 105 77 82

Also, read sections 1.6-1.8 in the text.

Flash:  For those of you who do not want to use the pc labs for Minitab, I have included a link to all the data sets in the text at the top of this page.  You can copy and paste the data into a Minitab worksheet and save it if you wish.  This will be helpful if you have Minitab at home or in a lab and do not have the disk that comes with the book.

For those of you who will use Minitab in one of the pc labs:

1.  Under Program Groups:  click Spreadsheets and Statistics
2.  Double click Minitab 11
3.  In Minitab:  File>open worksheet
4.  Drives:  select  i:\\hammond\instruct
5.  Under Stat directory:  462
6.  Double click the data set you want and it should be loaded into the Minitab worksheet.

Week 2  Assignment:  By now you have hopefully read sections 1.6-1.8.  I won't talk about maximum likelihood.  You should be aware that the least squares methods are also maximum likehood methods under the normal model.  Due Monday, Jan. 19:  Problem 1.27.  Do these computations using a calculator.  I will talk about using Minitab on Monday.

After discussing Minitab, I will begin Chapter 2.  If you want to read ahead, look at section 2.1.

We will look at the bootstrap for simple regression.  In particular, on Wednesday we carried out one cycle of the bootstrap by hand.  You should do this for yourself.  Use the Mercedes data and resample the residuals (using slips of paper) and compute b1*.  It would be a good idea to do this a couple of times to get a feeling for how the bootstrap works.  On Friday I will give you a general purpose Minitab bootstrap program and illustrate it.

Week 3.  Here is an exercise that will be due Wed. Jan. 28.  Suppose we want to investigate the relationship between  driver test scores and the number of beers that the driver has drunk.  Let Y=test score and X=number of beers.  Data:

 X 0 1 2 3 4 Y 80 84 76 70 72
The intercept in the model is beta0=expected score with no beers.  This a baseline score.  Let M3=beta0+3beta1, this is the expected score after 3 beers.  Using the data:
a.  Estimate M3
b.  Estimate the standard deviation of M3
(Hint:  Change the bootstrap program to compute M3 and run it 500 times.)
You might assign a score of M3/beta0.  Then people with different initial skills (no beer) would be more comparable.
c.  Estimate R=M3/beta0
d.  Estimate the standard deviation of R.

Week 4.  Assignment Due Wed., Feb. 4:  Exercises 2.27 and 2.28 p90 in the text.  You should do these with a calculator and then check them on the computer using Minitab or whatever program you are using.  In addition, Due Fri., Feb. 6:  In the muscle mass problem, let P = the percent reduction in the expected muscle mass over a 5 year period beginning with the average age in the study.  Estimate P and estimate the standard deviation of P.  Suppose you, as the researcher, believe that P will exceed 3%.  Does your analysis support this or not?

Read the rest of Chapter 2.  I will discuss various topics this week.  After prediction intervals, we will look at analysis of variance, the general linear test, and r-squared in detail.

Week 5.  If you did not hand in the problem concerning percent reduction on Friday, it is due Monday.  We will consider r-squared first and then move to Chapter 3, diagnostics and remedial measures.  This is one of the most important parts of the course.  A lot of the difficulties with the sensitivity of least squares can be demonstrated using the link at the top of this page under Other Resources:  Regression Demos.  I will illustrate it in class. Due Wed., Feb. 11:  Exercise 2.29 in the book.

Week 6.  Chapter 3, diagnostics and remedial measures this week.  Note that leverage is not taken up in the text until later, after the introduction of multiple regression.  We, however, will look at it closely for simple regression.  I will concentrate lectures on the following sections and material:  Sections 3.1, 3.2, 3.3 (plot of standardized resids vs x, boxplot of standardized residuals, and boxplot of x values).  Further, in section 3.3 look at: Nonlinearity of Regression Function, Nonconstancy of Error Variance, Presence of Outliers.  Then section 3.7 on lack of fit.  A little of section 3.9 on transformations, skipping the Box-Cox transformations.  I may briefly discuss some of the material in section 3.10 on regression smoothing.
I anticipate assigning the following problems as we cover the relevant topics:  p146...3.7 (a, b, c, only), 3.9,  3.13, 3.16 ( a, c, d, e, and f only). Due Friday:  3.7, 3.9 and Levene's test on the muscle mass data.

Week 7.  We'll finish chapter 3 this week.  Due Friday, Feb. 27: Exercises 3.13, 3.15, 3.16 (a, c, d, e, and f).  On Friday if there is time, I will talk about simultaneous inference.  Read pp152-159.  Next week, after the exam, we will begin matrix notation and some manipulations in preparation for multiple regression.

Week 8.  Exam week.  We will cover briefly the multiplicity problem.  You should read Sections 4.1, 4.2, and 4.3 in Chapter 4.  The other topics in Chapter 4 are of specialized interest and individuals may wish to read them.  If you do and want to discuss some aspect of the material please make an appointment to see me.  I strongly recommend that you read Section 4.7.  This material is a brief introduction to issues in designing an experiment and choosing the X values judiciously.

After break we will begin multiple regression.  First we must introduce some material from matrix theory.  Matrices will be mainly used for formulating the multiple regression problem.  I suggest reading Sections 5.1-5.7 and doing exercises 5.1 and 5.2 over the break.  These exercises will familiarize you with the basic matrix manipulations.

Week 9.  We begin Chapter 6 this week.  Read Section 6.1 carefully.  I will illustrate many of the ideas on the following grandfather clock data.  You may wish to work with the data yourself on some of the examples from class.  If you wish to copy and paste it into a Minitab worksheet, the columns are price, age, bidders, age-squared, bidders-squared, and age times bidders:

1055 108 14 11664 196 1512
729 108 6 11664 36 648
1175 111 15 12321 225 1665
785 111 7 12321 49 777
946 113 9 12769 81 1017
1080 115 12 13225 144 1380
744 115 7 13225 49 805
1024 117 11 13689 121 1287
1152 117 13 13689 169 1521
1336 126 10 15876 100 1260
1235 127 13 16129 169 1651
845 127 7 16129 49 889
1253 132 10 17424 100 1320
1297 137 9 18769 81 1233
1713 137 15 18769 225 2055
1147 137 8 18769 64 1096
854 143 6 20449 36 858
1522 150 9 22500 81 1350
1092 153 6 23409 36 918
1047 156 6 24336 36 936
1822 156 12 24336 144 1872
1483 159 9 25281 81 1431
1884 162 11 26244 121 1782
1262 168 7 28224 49 1176
2131 170 14 28900 196 2380
1545 175 8 30625 64 1400
1792 179 9 32041 81 1611
1979 182 11 33124 121 2002
1550 182 8 33124 64 1456
2041 184 10 33856 100 1840
1593 187 8 34969 64 1496
1356 194 5 37636 25 970

Week 10.  In addition to the exercise given in class that is due Monday, March 23, you should read Sections 6.1 through 6.5.  On  Monday I will assign Exercise 6.15 parts a-f and it will be due Wednesday or Friday depending on how far I get on Monday.

Read Sections 6.6 and 6.7 for a discussion of inference on the regression coefficients, expected values, and predicted new values.  Exercises 6.16 and 6.17 will be due sometime next week, probably Wednesday, April 1.  You should also read 6.8 for a brief discussion of diagnostics.  Finally, 6.9 may be helpful since it is an extended worked out example.

Week 11.  Exercises 6.16 and 6.17 are due Wednesday, April 1.  In the meantime we will begin with the extra sum of squares principle.  Read Sections 7.1 and 7.2.  Due Friday, April 3:  Exercises 7.5 and 7.6 p317 (Patient Satisfaction data) on extra sums of squares methods.

Week 12.  Exam this week.  Begin discussion of variance inflation factors.  The discussion of multicolinearity is in Sections 7.5, 7.6, and 9.5.  Exercises due Friday, April 17:   7.14, 7.18, and 9.17.

Week 13.  We will finish up the discussion of variance inflation factors and extend the diagnostic to several predictors.  We will apply this to polynomial models (especially the clock data).  Read sections 7.7 and 7.8 in the text.  Much of this will be familiar and I will not repeat all this material.

Week 14.  End of week 13:  begin selection of variables.  Read sections 8.1 and 8.2; they provide a good overview of the problem of selection of variables and also provide an example.  Then read section 8.3, especially pages 336-345.  Then read pages 346 and 347 on Best Subsets.  Be sure you understand R-squared and adj R-squared along with Cp.  Due Wednesday, April 21:  Look at exercises 8.9 and 8.11.  In these exercises use Minitab's best subsets regression to try to identify a good model.  You do not have to do all the other parts of the exercises.  You might put in some of the quadratic terms if the LOF test suggests that you have curvature.   Then the best subsets regression is applied to the full set.  Don't forget to center a variable if you put in quadratic terms in that variable.

Week 15Note:  For the exercise due Monday, April 27, in the second set of 10 rows, the first 60 (Humidity) should be 40 in three places.  We will continue the discussion of weighted least squares.  In preparation for robust regression, read about Studentized residuals and studentized deleted residuals, pp372-375.  Then we will need DFFITS discussed on pp379-380.  Finally we will be ready for Section 10.3 pp416-424.

Exercise for Friday, May 1:  I will discuss this data and answer questions on Friday but you do not have to hand it in.  Data from a study of shelf life of packaged food:  y = moisture content of cereal and x = days on the shelf.  The idea is that the soggier the cereal the more unappetizing it is.
Here is the data:  row, y, x (cut and paste this into a Minitab worksheet)

1   2.8    0
2   3.0    3
3   3.1    6
4   3.2    8
5   3.4   10
6   3.4   13
7   3.5   16
8   3.1   20
9   3.8   24
10   4.0   27
11   4.1   30
12   4.3   34
13   4.4   37
14   5.9   41

Plot y vs x.  Get the regular regression equation.  Find leverage, deleted t residuals, and DFFITS.  Construct the weights and get the robust regression.  Compare the regular regression equation to the robust regression equation.  Plot t resids vs fits.