Lesson #1: Simple Linear Regression
Drawing conclusions about β_{0} and β_{1}
Recall that we are ultimately always interested in drawing conclusions about the population not the particular sample we observed. In the simple regression setting, we are often interested in learning about the population intercept β_{0} and the population slope β_{1}. As you know, confidence intervals and hypothesis tests are two related, but different, ways of learning about the values of population parameters. Here, we will learn how to calculate confidence intervals and conduct hypothesis tests for both β_{0} and β_{1}.
Let's revisit the example concerning the relationship between skin cancer mortality and state latitude. The response variable y is the mortality rate (number of deaths per 10 million people) of white males due to malignant skin melanoma from 19501959. The predictor variable x is the latitude (degrees North) at the center of each of 49 states in the United States. A subset of the data look like:
#

State

Latitude

Mortality

1

Alabama

33.0

219

2

Arizona

34.5

160

3

Arkansas

35.0

170

4

California

37.5

182

5

Colorado

39.0

149









49

Wyoming

43.0

134

and a plot of the data with the estimated regression equation looks like:
Is there a relationship between state latitude and skin cancer mortality? Certainly, since the estimated slope of the line, b_{1}, is 5.98, not 0, there is a relationship between state latitude and skin cancer mortality in the sample of 49 data points. But, we want to know if there is a relationship between the population of all of the latitudes and skin cancer mortality rates. That is, we want to know if the population slope β_{1} is 0.
(1α)100% tinterval for slope parameter β_{1}
The formula for the confidence interval for β_{1}, in words, is:
Sample estimate ± (tmultiplier × standard error)
and, in notation, is:
The resulting confidence interval not only gives us a range of values that is likely to contain the true unknown value β_{1}. It also allows us to answer the research question "is the predictor x related to the response y?" If the confidence interval for β_{1} contains 0, then we conclude that there is no evidence of a relationship between the predictor x and the response y in the population. On the other hand, if the confidence interval for β_{1} does not contain 0, then we conclude that there is evidence of a relationship between the predictor x and the response y in the population.
An αlevel hypothesis test for slope parameter β_{1}
We follow standard hypothesis test procedures in conducting a hypothesis test for the slope β_{1}. First, we specify the null and alternative hypotheses:
Null hypothesis
H_{0} : β_{1} = some number β
Alternative hypothesisH_{A} : β_{1} ≠ some number β
The phrase "some number β" means that you can test whether or not the population slope takes on any value. Most often, however, we are interested in testing whether β_{1} is 0. By default, Minitab conducts the hypothesis test for testing whether or not β_{1} is 0. But, the alternative hypothesis can also state that β_{1} is less than (<) some number β or greater than (>) some number β.
Second, we calculate the value of the test statistic using the following formula:
Third, we use the resulting test statistic to calculate the Pvalue. As always, the Pvalue is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The Pvalue is determined by referring to a tdistribution with n2 degrees of freedom.
Finally, we make a decision:
 If the Pvalue is smaller than the significance level α, we reject the null hypothesis in favor of the alternative. We conclude "there is sufficient evidence at the α level to conclude that there is a relationship in the population between the predictor x and response y."
 If the Pvalue is larger than the significance level α, we fail to reject the null hypothesis. We conclude "there is not enough evidence at the α level to conclude that there is a relationship in the population between the predictor x and response y."
Drawing conclusions about slope parameter β_{1} using Minitab
Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the slope β_{1}. Minitab's regression analysis output for our skin cancer mortality and latitude example appears below.
The line pertaining to the latitude predictor, Lat, in the summary table of predictors has been bolded. It tells us that the estimated slope coefficient b_{1}, under the column labeled Coef, is 5.9776. The estimated standard error of b_{1}, denoted se(b_{1}), in the column labeled SE Coef for "standard error of the coefficient," is 0.5984.
By default, the test statistic is calculated assuming the user wants to test that the slope is 0. Dividing the estimated coefficient 5.9776 by the estimated standard error 0.5984, Minitab reports that the test statistic T is 9.99.
By default, the Pvalue is calculated assuming the alternative
hypothesis is a "twotailed, notequalto" hypothesis.
Upon calculating the probability that a trandom variable with
Because the Pvalue is so small (less than 0.001), we can reject
the null hypothesis and conclude that β_{1} does
not equal 0. There is sufficient evidence,
at the
Minitab Note. The Pvalue in Minitab's regression
analysis output is always calculated assuming the alternative hypothesis is
testing the twotailed β_{1}≠ 0. If your alternative
hypothesis is the onetailed
Unfortunately, Minitab does not calculate a 95% confidence interval for β_{1}
for you. It's easy to calculate one though using the information in the output.
You just need to use Minitab to find the tmultiplier for you. It
is
We can be 95% confident that the population slope is between 7.2 and 4.8. That is, we can be 95% confident that for every additional onedegree increase in latitude, the mean skin cancer mortality rate decreases between 4.8 and 7.2 deaths per 10 million people.
Factors affecting the width of a confidence interval for β_{1}
Recall that, in general, we want our confidence intervals to be as narrow as possible. If we know what factors affect the length of a confidence for the slope β_{1}, we can control them to ensure that we obtain a narrow interval. The factors can be easily determined by studying the formula for the confidence interval:
First, subtracting the lower endpoint of the interval from the upper endpoint of the interval, we determine that the width of the interval is:
So, how can we affect the width of our resulting interval for β_{1}?
 As the confidence level decreases, the width of the interval decreases. As the confidence level decreases, the tmultiplier decreases. Therefore, if we decrease our confidence level, we decrease the width of our interval. Clearly, we don't want to decrease the confidence level too much. Typically, confidence levels are never set below 90%.
 As MSE decreases, the width of the interval decreases. The value of MSE depends on only two factors — how much the responses vary naturally around the estimated regression line, and how well your regression function (line) fits the data. Clearly, you can't control the first factor all that much other than to ensure that you are not adding any unnecessary error in your measurement process. Throughout this course, we'll learn ways to make sure that the regression function fits the data as well as it can.
 The more spread out the predictor x values, the narrower the interval. The quantity in the denominator summarizes the spread of the predictor x values. The more spread out the predictor values, the larger the denominator, and hence the narrower the interval. Therefore, we can decrease the width our interval by ensuring that our predictor values are sufficiently spread out.
 As the sample size increases, the width of the interval decreases. The sample size plays a role in two ways. First, recall that the tmultiplier depends on the sample size through n2. Therefore, as the sample size increases, the tmultiplier decreases, the length of the interval decreases. Second, the denominator also depends on n. The larger the sample size, the more terms you add to this sum, the larger the denominator, the narrower the interval. Therefore, in general, you can ensure that your interval is narrow by having a large enough sample.
Six possible outcomes concerning slope β_{1}
There are six possible outcomes whenever we test whether there is a linear relationship between the predictor x and the response y, that is, whenever we test the null hypothesis H_{0} : β_{1} = 0 against the alternative hypothesis H_{A} : β_{1} ≠ 0.
When we don't reject the null hypothesis H_{0} : β_{1} = 0, any of the following three realities are possible:
 We committed a Type II error. That is, in reality β_{1} ≠ 0 and our sample data just didn't provide enough evidence to conclude that β_{1} ≠ 0.
 There really is not much of a linear relationship between x and y.
 There is a relationship between x and y — it is just not linear.
When we do reject the null hypothesis H_{0} : β_{1} = 0 in favor of the alternative hypothesis H_{A} : β_{1} ≠ 0, any of the following three realities are possible:
 We committed a Type I error. That is, in reality
β_{1} = 0, but we have an unusual sample that suggests thatβ_{1} ≠ 0.  The relationship between x and y is indeed linear.
 A linear function fits the data okay, but a curved ("curvilinear") function would fit the data even better.
PRACTICE PROBLEMS: Inference for β_{1}The six exercises in this section are designed to illustrate the six possible outcomes that can happen whenever you conduct a hypothesis test about the slope parameter β_{1}. Directions. Type up your answers to each of the following questions in a Word file named practice01_yourPSUid.doc. Once you have completed all of the practice problems in this lesson, upload the file to the "Lesson #1 Practice Problems" dropbox. 
1.5. A Type II error? In reality, when we analyze "realworld
data," we can never really know for sure whether or not we committed
a Type II error. What we can do in a lab setting though is to create
a population of data in which we know there is a linear trend
between x and y . If we take a small sample from that
population, our sample just might not provide enough evidence to conclude
β_{1} ≠ 0. I’ve created a "population" of 1000
(x, y) data points in typeII.txt
in which I know there is linear trend between x and y
, as described by y =5+0.5x, but there is also (lots of)
error. Each student is asked to sample from this population and to perform
a regression analysis to see if we can get at least one student who
commits a Type II error upon testing

1.6. Really not much of a linear relationship between x and y. A random sample of 35 students in an introductory undergraduate statistics class was selected. The height (height, in inches) of each selected student was measured, and the student’s overall grade point average (gpa) recorded. The data set heightgpa.txt contains the resulting data set. Is there a linear relationship between height (x) and gpa (y)?

1.7. A relationship between x and y, it's just not linear. Margolin (1988) reports data from an experiment in which a drop of urine at a certain concentration is dropped on a petri dish. Some time later, the number of visible bacterial colonies growing on the plate is counted. The data set urine.txt contains the results of the experiment — conc, the concentration of the urine (mL/petri) and colonies, the number of grown bacterial colonies. Is there a linear relationship between conc (x) and colonies (y)?

1.8. A Type I error? As is true for Type II errors, when we
analyze “realworld data,” we can never really know for sure whether
or not we committed a Type I error. What we can do in a lab setting
though is to create a population of data in which we know there
is no linear trend between x and y . If we take a small
sample from that population, our sample might be unusual enough to suggest
that β_{1} ≠ 0. I’ve created a “population” of
1000 (x, y) data points in typeI.txt
in which I know there is no linear trend between x and y.
Each student is asked to sample from this population and to perform
a regression analysis to see if we can get at least one student who
commits a Type I error upon testing

1.9. Relationship between x and y
is indeed linear. The oldfaithful.txt
data set contains data on 21 consecutive eruptions of Old Faithful geyser
in Yellowstone National Park. It is believed that one can predict the time
until the next eruption (next), given the length of time of the last
eruption (duration). Is there a linear relationship between duration
(x) and next (y)?

1.10. Linear function does okay, but curvilinear function might do better. A laboratory tested tires for tread wear by running the following experiment. Tires of a certain brand were mounted on a car. The tires were rotated from position to position every 1000 miles, and the groove depth was measured in mils (0.001 inches) initially and after every 4000 miles. Measurements were made at six equiangular positions on each of the six grooves around the circumference of the tire, and averaged to provide a measure of tread wear. The data set treadwear.txt gives the resulting mileage (in 1000 miles) and groove depth (in mils) of the resulting measurements.

(1α)100% tinterval for intercept parameter β_{0}
Calculating confidence intervals and conducting hypothesis tests for the intercept parameter β_{0} is not done as often as it is for the slope parameter β_{1}. The reason for this becomes clear upon reviewing the meaning of β_{0}. The intercept parameter β_{0} is the mean of the responses at x = 0. If x = 0 is meaningless, as it would be, for example, if your predictor variable was height, then β_{0} is not meaningful. For the sake of completeness, we present the methods here for those situations in which β_{0} is meaningful.
The formula for the confidence interval for β_{0}, in words, is:
Sample estimate ± (tmultiplier × standard error)
and, in notation, is:
The resulting confidence interval gives us a range of values that is likely to contain the true unknown value β_{0}. The factors affecting the length of a confidence interval for β_{0} are identical to the factors affecting the length of a confidence interval for β_{1}.
An αlevel hypothesis test for intercept parameter β_{0}
Again, we follow standard hypothesis test procedures. First, we specify the null and alternative hypotheses:
Null hypothesis
H_{0} : β_{0} = some number β
Alternative hypothesisH_{A}: β_{0} ≠ some number β
The phrase "some number β" means that you can test whether or not the population intercept takes on any value. By default, Minitab conducts the hypothesis test for testing whether or not β_{0} is 0. But, the alternative hypothesis can also state that β_{0} is less than (<) some number β or greater than (>) some number β.
Second, we calculate the value of the test statistic using the following formula:
Third, we use the resulting test statistic to calculate the Pvalue. Again, the Pvalue is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The Pvalue is determined by referring to a tdistribution with n2 degrees of freedom.
Finally, we make a decision. If the Pvalue is smaller than the significance level α, we reject the null hypothesis in favor of the alternative. If we conduct a "twotailed, notequalto0" test, we conclude "there is sufficient evidence at the α level to conclude that the mean of the responses is not 0 when x = 0." If the Pvalue is larger than the significance level α, we fail to reject the null hypothesis.
Drawing conclusions about intercept parameter β_{0} using Minitab
Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the intercept β_{0}. Minitab's regression analysis output for our skin cancer mortality and latitude example appears below. The work involved is very similar to that for the slope β_{1}.
The line pertaining to the intercept, which Minitab always refers to as Constant, in the summary table of predictors has been bolded. It tells us that the estimated slope coefficient b_{0}, under the column labeled Coef, is 389.19. The estimated standard error of b_{0}, denoted se(b_{0}), in the column labeled SE Coef is 23.81.
By default, the test statistic is calculated assuming the user wants to test that the mean response is 0 when x = 0. Note that this is an illadvised test here, because the predictor values in the sample do not include a latitude of 0. That is, such a test involves extrapolating outside the scope of the model. Nonetheless, for the sake of illustration, let's proceed assuming that it is an okay thing to do.
Dividing the estimated coefficient 389.19 by the estimated standard error
23.81, Minitab reports that the test statistic T
is 16.34. By default, the Pvalue is calculated
assuming the alternative hypothesis is a "twotailed, notequalto0"
hypothesis. Upon calculating the probability that a t random variable
with
Because the Pvalue is so small (less than 0.001), we can reject
the null hypothesis and conclude that β_{0} does not equal 0
when x = 0. There is sufficient evidence, at the
Minitab does not calculate a 95% confidence interval for β_{0}
either. Proceed similarly. Use Minitab to find the tmultiplier
for you. Again, it is t_{(0.025, 47)} =
2.0117. Then, the 95% confidence interval for β_{0} is
What conditions?
We've made no mention of the conditions that must be true in order for it to be okay to use the above confidence interval formulas and hypothesis testing procedures for β_{0} and β_{1}. In short, the "LINE" assumptions — linearity, independence, normality and equal variance — must hold. It is not a big deal if the error terms (and thus responses) are only approximately normal. If you have a large sample, then the error terms can even deviate far from normality.