Lesson #1: Simple Linear Regression

Drawing conclusions about β0 and β1

Recall that we are ultimately always interested in drawing conclusions about the population not the particular sample we observed. In the simple regression setting, we are often interested in learning about the population intercept β0 and the population slope β1. As you know, confidence intervals and hypothesis tests are two related, but different, ways of learning about the values of population parameters. Here, we will learn how to calculate confidence intervals and conduct hypothesis tests for both β0 and β1.

Let's revisit the example concerning the relationship between skin cancer mortality and state latitude. The response variable y is the mortality rate (number of deaths per 10 million people) of white males due to malignant skin melanoma from 1950-1959. The predictor variable x is the latitude (degrees North) at the center of each of 49 states in the United States. A subset of the data look like:

#
State
Latitude
Mortality
1
Alabama
33.0
219
2
Arizona
34.5
160
3
Arkansas
35.0
170
4
California
37.5
182
5
Colorado
39.0
149
---
---
---
---
49
Wyoming
43.0
134

and a plot of the data with the estimated regression equation looks like:

mortality vs latitude plot

Is there a relationship between state latitude and skin cancer mortality? Certainly, since the estimated slope of the line, b1, is -5.98, not 0, there is a relationship between state latitude and skin cancer mortality in the sample of 49 data points. But, we want to know if there is a relationship between the population of all of the latitudes and skin cancer mortality rates. That is, we want to know if the population slope β1 is 0.

(1-α)100% t-interval for slope parameter β1

The formula for the confidence interval for β1, in words, is:

Sample estimate ± (t-multiplier × standard error)

and, in notation, is:

confidence interval forumla

The resulting confidence interval not only gives us a range of values that is likely to contain the true unknown value β1. It also allows us to answer the research question "is the predictor x related to the response y?" If the confidence interval for β1 contains 0, then we conclude that there is no evidence of a relationship between the predictor x and the response y in the population. On the other hand, if the confidence interval for β1 does not contain 0, then we conclude that there is evidence of a relationship between the predictor x and the response y in the population.

An α-level hypothesis test for slope parameter β1

We follow standard hypothesis test procedures in conducting a hypothesis test for the slope β1. First, we specify the null and alternative hypotheses:

Null hypothesis H0 : β1 = some number β
Alternative hypothesis HA : β1 ≠ some number β

The phrase "some number β" means that you can test whether or not the population slope takes on any value. Most often, however, we are interested in testing whether β1 is 0. By default, Minitab conducts the hypothesis test for testing whether or not β1 is 0. But, the alternative hypothesis can also state that β1 is less than (<) some number β or greater than (>) some number β.

Second, we calculate the value of the test statistic using the following formula:

test statistic formula

Third, we use the resulting test statistic to calculate the P-value. As always, the P-value is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The P-value is determined by referring to a t-distribution with n-2 degrees of freedom.

Finally, we make a decision:

Drawing conclusions about slope parameter β1 using Minitab

Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the slope β1. Minitab's regression analysis output for our skin cancer mortality and latitude example appears below.

The line pertaining to the latitude predictor, Lat, in the summary table of predictors has been bolded. It tells us that the estimated slope coefficient b1, under the column labeled Coef, is -5.9776. The estimated standard error of b1, denoted se(b1), in the column labeled SE Coef for "standard error of the coefficient," is 0.5984.

minitab output

By default, the test statistic is calculated assuming the user wants to test that the slope is 0. Dividing the estimated coefficient -5.9776 by the estimated standard error 0.5984, Minitab reports that the test statistic T is -9.99.

By default, the P-value is calculated assuming the alternative hypothesis is a "two-tailed, not-equal-to" hypothesis. Upon calculating the probability that a t-random variable with n-2 = 47 degrees of freedom would be larger than 9.99, and multiplying the probability by 2, Minitab reports that P is 0.000 (to three decimal places). That is, the P-value is less than 0.001.

Because the P-value is so small (less than 0.001), we can reject the null hypothesis and conclude that β1 does not equal 0. There is sufficient evidence, at the α = 0.05 level, to conclude that there is a relationship in the population between skin cancer mortality and latitude.

Minitab Note. The P-value in Minitab's regression analysis output is always calculated assuming the alternative hypothesis is testing the two-tailed β1≠ 0. If your alternative hypothesis is the one-tailed β1< 0 or β1> 0, you have to divide the P-value that Minitab reports in the summary table of predictors by 2. Note, though, that this "trick" of dividing the P-value by 2 is only appropriate if your estimated slope b1 is the same sign as the sign of the slope β1 specified in the alternative hypothesis.

Unfortunately, Minitab does not calculate a 95% confidence interval for β1 for you. It's easy to calculate one though using the information in the output. You just need to use Minitab to find the t-multiplier for you. It is t(0.025, 47) = 2.0117. Then, the 95% confidence interval for β1 is -5.9776 ± 2.0117(0.5984) or (-7.2, -4.8).

We can be 95% confident that the population slope is between -7.2 and -4.8. That is, we can be 95% confident that for every additional one-degree increase in latitude, the mean skin cancer mortality rate decreases between 4.8 and 7.2 deaths per 10 million people.

Factors affecting the width of a confidence interval for β1

Recall that, in general, we want our confidence intervals to be as narrow as possible. If we know what factors affect the length of a confidence for the slope β1, we can control them to ensure that we obtain a narrow interval. The factors can be easily determined by studying the formula for the confidence interval:

confidence interval formula

First, subtracting the lower endpoint of the interval from the upper endpoint of the interval, we determine that the width of the interval is:

So, how can we affect the width of our resulting interval for β1?

Six possible outcomes concerning slope β1

There are six possible outcomes whenever we test whether there is a linear relationship between the predictor x and the response y, that is, whenever we test the null hypothesis H0 : β1 = 0 against the alternative hypothesis HA : β1 ≠ 0.

When we don't reject the null hypothesis H0 : β1 = 0, any of the following three realities are possible:

    1. We committed a Type II error. That is, in reality β1 ≠ 0 and our sample data just didn't provide enough evidence to conclude that β1 ≠ 0.
    2. There really is not much of a linear relationship between x and y.
    3. There is a relationship between x and y — it is just not linear.

When we do reject the null hypothesis H0 : β1 = 0 in favor of the alternative hypothesis HA : β1 ≠ 0, any of the following three realities are possible:

    1. We committed a Type I error. That is, in reality β1 = 0, but we have an unusual sample that suggests that β1 ≠ 0.
    2. The relationship between x and y is indeed linear.
    3. A linear function fits the data okay, but a curved ("curvilinear") function would fit the data even better.

PRACTICE PROBLEMS: Inference for β1

The six exercises in this section are designed to illustrate the six possible outcomes that can happen whenever you conduct a hypothesis test about the slope parameter β1.

Directions. Type up your answers to each of the following questions in a Word file named practice01_yourPSUid.doc. Once you have completed all of the practice problems in this lesson, upload the file to the "Lesson #1 Practice Problems" dropbox.


1.5. A Type II error? In reality, when we analyze "real-world data," we can never really know for sure whether or not we committed a Type II error. What we can do in a lab setting though is to create a population of data in which we know there is a linear trend between x and y . If we take a small sample from that population, our sample just might not provide enough evidence to conclude β1 ≠ 0. I’ve created a "population" of 1000 (x, y) data points in typeII.txt in which I know there is linear trend between x and y , as described by y =5+0.5x, but there is also (lots of) error. Each student is asked to sample from this population and to perform a regression analysis to see if we can get at least one student who commits a Type II error upon testing H0: β1 = 0.

  1. Create a scatter plot of the whole "population" of 1000 (x, y) data points to convince yourself the linear trend between x and y can be described (roughly) by y = 5+0.5x, but there is also (lots of) error. (See Minitab Help: Creating a basic scatter plot.)
  2. Randomly sample 10 (x, y) data points from the population, and store the sample in two unused columns in your worksheet. (See Minitab Help: Randomly sampling (x,y) data with replacement from two columns.)
  3. Create a fitted line plot so you can get a visual feel for whether or not your sample suggests a linear trend between x and y. (See Minitab Help: Creating a fitted line plot.)
  4. Conduct a standard regression analysis in Minitab, so that you can test H0: β1 = 0 versus HA: β1 ≠ 0 at the α = 0.05 level. (See Minitab Help: Performing a basic regression analysis.)
  5. Since we know that β1 = 0.5 for this population, you committed a Type II error if you failed to reject the null hypothesis, right? Did you commit a Type II error?

1.6. Really not much of a linear relationship between x and y. A random sample of 35 students in an introductory undergraduate statistics class was selected. The height (height, in inches) of each selected student was measured, and the student’s overall grade point average (gpa) recorded. The data set heightgpa.txt contains the resulting data set. Is there a linear relationship between height (x) and gpa (y)?

  1. Create a fitted line plot so you can get a visual feel for whether or not the sample suggests a linear trend between height and gpa. (See Minitab Help: Creating a fitted line plot.)
  2. Conduct a standard regression analysis in Minitab, so that you can test H0 : β1 = 0 versus HA : β1 ≠ 0. (See Minitab Help: Performing a basic regression analysis.)
  3. Based on your plot and the results of your hypothesis test, do you feel comfortable concluding that there is no relationship between height and gpa? Briefly explain.

1.7. A relationship between x and y, it's just not linear. Margolin (1988) reports data from an experiment in which a drop of urine at a certain concentration is dropped on a petri dish. Some time later, the number of visible bacterial colonies growing on the plate is counted. The data set urine.txt contains the results of the experiment — conc, the concentration of the urine (mL/petri) and colonies, the number of grown bacterial colonies. Is there a linear relationship between conc (x) and colonies (y)?

  1. Create a fitted line plot so you can get a visual feel for whether or not the sample suggests a linear trend between conc and colonies. (See Minitab Help: Creating a fitted line plot.)
  2. Conduct a standard regression analysis in Minitab, so that you can test H0 : β1 = 0 versus HA : β1 ≠ 0. (See Minitab Help: Performing a basic regression analysis.)
  3. Based on your plot and the results of your hypothesis test, do you feel comfortable concluding that there is no relationship between conc and colonies? Briefly explain.

1.8. A Type I error? As is true for Type II errors, when we analyze “real-world data,” we can never really know for sure whether or not we committed a Type I error. What we can do in a lab setting though is to create a population of data in which we know there is no linear trend between x and y . If we take a small sample from that population, our sample might be unusual enough to suggest that β1 ≠ 0. I’ve created a “population” of 1000 (x, y) data points in typeI.txt in which I know there is no linear trend between x and y. Each student is asked to sample from this population and to perform a regression analysis to see if we can get at least one student who commits a Type I error upon testing H0 : β1 = 0.

  1. Create a scatter plot of the whole "population” of 1000 (x, y) data points to convince yourself that there is no linear trend between x and y.
  2. Randomly sample 20 (x, y) data points from the population, and store the sample in two unused columns in your worksheet. (See Minitab Help: Randomly sampling (x,y) data from two columns.)
  3. Create a fitted line plot so you can get a visual feel for whether or not your sample suggests a linear trend between x and y. (See Minitab Help: Creating a fitted line plot.)
  4. Conduct a standard regression analysis in Minitab, so that you can test H0 : β1 = 0 versus HA : β1 ≠ 0. (See Minitab Help: Performing a basic regression analysis.)
    1. Conduct the test at the α = 0.10 level. Since we know that β1 = 0 for this population, you committed a Type I error if you rejected the null, right? Did you commit a Type I error?
    2. Conduct the test again, but now at the α = 0.05 level. Again, since we know that β1 = 0 for this population, you committed a Type I error if you rejected the null hypothesis, right? Did you commit a Type I error this time?

1.9. Relationship between x and y is indeed linear. The oldfaithful.txt data set contains data on 21 consecutive eruptions of Old Faithful geyser in Yellowstone National Park. It is believed that one can predict the time until the next eruption (next), given the length of time of the last eruption (duration). Is there a linear relationship between duration (x) and next (y)?
  1. Create a fitted line plot so you can get a visual feel for whether or not the sample of 21 eruptions suggests a linear trend between next and duration. (See Minitab Help: Creating a fitted line plot.)
  2. Conduct a standard regression analysis in Minitab, so that you can test H0 : β1 = 0 versus HA : β1 ≠ 0 at the α = 0.05 level. (See Minitab Help: Performing a basic regression analysis.)
  3. Use Minitab’s standard regression analysis output to calculate a 95% confidence interval for β1. (See Minitab Help: Finding a t-multiplier.)
    1. Write one sentence that summarizes what the interval tells us.
    2. Do the 95% confidence interval and the two-tailed hypothesis test at the α = 0.05 level yield similar conclusions? Why or why not? Will this always be the case?
    3. What does the 95% confidence interval tell us that the hypothesis test doesn’t?
  4. Based on your plot and the results of your hypothesis test, do you feel comfortable concluding that there is a linear relationship between next and duration?

1.10. Linear function does okay, but curvilinear function might do better. A laboratory tested tires for tread wear by running the following experiment. Tires of a certain brand were mounted on a car. The tires were rotated from position to position every 1000 miles, and the groove depth was measured in mils (0.001 inches) initially and after every 4000 miles. Measurements were made at six equiangular positions on each of the six grooves around the circumference of the tire, and averaged to provide a measure of tread wear. The data set treadwear.txt gives the resulting mileage (in 1000 miles) and groove depth (in mils) of the resulting measurements.

  1. Create a fitted line plot so you can get a visual feel for whether or not the sample of 9 measurements suggests a linear trend between mileage and groove. What type of relationship does your fitted line plot suggest between mileage and groove? (See Minitab Help: Creating a fitted line plot.)
  2. Conduct a standard regression analysis in Minitab, so that you can test H0 : β1 = 0 versus HA : β1 ≠ 0. What does the result of your hypothesis test suggest about the relationship between mileage and groove? (See Minitab Help: Performing a basic regression analysis.)
  3. Based on your plot and the results of your hypothesis test, do you feel comfortable concluding that a linear model is the best way of describing the relationship between mileage and groove? Or might a "curvilinear” (curved) model do better?

(1-α)100% t-interval for intercept parameter β0

Calculating confidence intervals and conducting hypothesis tests for the intercept parameter β0 is not done as often as it is for the slope parameter β1. The reason for this becomes clear upon reviewing the meaning of β0. The intercept parameter β0 is the mean of the responses at x = 0. If x = 0 is meaningless, as it would be, for example, if your predictor variable was height, then β0 is not meaningful. For the sake of completeness, we present the methods here for those situations in which β0 is meaningful.

The formula for the confidence interval for β0, in words, is:

Sample estimate ± (t-multiplier × standard error)

and, in notation, is:

confidence interval formula

The resulting confidence interval gives us a range of values that is likely to contain the true unknown value β0. The factors affecting the length of a confidence interval for β0 are identical to the factors affecting the length of a confidence interval for β1.

An α-level hypothesis test for intercept parameter β0

Again, we follow standard hypothesis test procedures. First, we specify the null and alternative hypotheses:

Null hypothesis H0 : β0 = some number β
Alternative hypothesis HA: β0 ≠ some number β

The phrase "some number β" means that you can test whether or not the population intercept takes on any value. By default, Minitab conducts the hypothesis test for testing whether or not β0 is 0. But, the alternative hypothesis can also state that β0 is less than (<) some number β or greater than (>) some number β.

Second, we calculate the value of the test statistic using the following formula:

test statistic formula

Third, we use the resulting test statistic to calculate the P-value. Again, the P-value is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The P-value is determined by referring to a t-distribution with n-2 degrees of freedom.

Finally, we make a decision. If the P-value is smaller than the significance level α, we reject the null hypothesis in favor of the alternative. If we conduct a "two-tailed, not-equal-to-0" test, we conclude "there is sufficient evidence at the α level to conclude that the mean of the responses is not 0 when x = 0." If the P-value is larger than the significance level α, we fail to reject the null hypothesis.

Drawing conclusions about intercept parameter β0 using Minitab

Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the intercept β0. Minitab's regression analysis output for our skin cancer mortality and latitude example appears below. The work involved is very similar to that for the slope β1.

The line pertaining to the intercept, which Minitab always refers to as Constant, in the summary table of predictors has been bolded. It tells us that the estimated slope coefficient b0, under the column labeled Coef, is 389.19. The estimated standard error of b0, denoted se(b0), in the column labeled SE Coef is 23.81.

minitab output

By default, the test statistic is calculated assuming the user wants to test that the mean response is 0 when x = 0. Note that this is an ill-advised test here, because the predictor values in the sample do not include a latitude of 0. That is, such a test involves extrapolating outside the scope of the model. Nonetheless, for the sake of illustration, let's proceed assuming that it is an okay thing to do.

Dividing the estimated coefficient 389.19 by the estimated standard error 23.81, Minitab reports that the test statistic T is 16.34. By default, the P-value is calculated assuming the alternative hypothesis is a "two-tailed, not-equal-to-0" hypothesis. Upon calculating the probability that a t random variable with n-2 = 47 degrees of freedom would be larger than 16.34, and multiplying the probability by 2, Minitab reports that P is 0.000 (to three decimal places). That is, the P-value is less than 0.001.

Because the P-value is so small (less than 0.001), we can reject the null hypothesis and conclude that β0 does not equal 0 when x = 0. There is sufficient evidence, at the α = 0.05 level, to conclude that the mean mortality rate at a latitude of 0 degrees North is not 0. (Again, note that we extrapolated in order to arrive at this conclusion.)

Minitab does not calculate a 95% confidence interval for β0 either. Proceed similarly. Use Minitab to find the t-multiplier for you. Again, it is t(0.025, 47) = 2.0117. Then, the 95% confidence interval for β0 is 389.19 ± 2.0117(23.81) = (341.3, 437.1). We can be 95% confident that the population intercept is between 341.3 and 437.1. That is, we can be 95% confident that the mean mortality rate at a latitude of 0 degrees North is between 341.3 and 437.1 deaths per 10 million people. (Again, it is probably not a good idea to make this claim because of the severe extrapolation involved.)

What conditions?

We've made no mention of the conditions that must be true in order for it to be okay to use the above confidence interval formulas and hypothesis testing procedures for β0 and β1. In short, the "LINE" assumptions — linearity, independence, normality and equal variance — must hold. It is not a big deal if the error terms (and thus responses) are only approximately normal. If you have a large sample, then the error terms can even deviate far from normality.

Click on "Next" above to continue this lesson.

© 2004 The Pennsylvania State University. All rights reserved.
Materials developed by Dr. Laura J. Simon (Lecturer, Penn State Department of Statistics).