Introduction
Let's get started! Here is what you will learn in this lesson.
Learning objectives for this lesson
Upon completion of this lesson, you should be able to do the following:
 Understand the relationship between the slope of the regression line and correlation,
 Comprehend the meaning of the Coefficient of Determination, R^{2,}
 Now how to determine which variable is a response and which is an explanatory in a regression equation,
 Understand that correlation measures the strength of a linear relationship between two variables,
 Realize how outliers can influence a regression equation, and
 Determine if variables are categorical or quantitative.
Examining Relationships Between Two Variables
Previously we considered the distribution of a single quantitative variable. Now we will study the relationship between two variables where both variables are qualitative, i.e. categorical, or quantitative. When we consider the relationship between two variables, there are three possibilities:
 Both variables are categorical. We analyze an association through a comparison of conditional probabilities and graphically represent the data using contingency tables. Examples of categorical variables are gender and class standing.
 Both variables are quantitative. To analyze this situation we consider how one variable, called a response variable, changes in relation to changes in the other variable called an explanatory variable. Graphically we use scatterplots to display two quantitative variables. Examples are age, height, weight (i.e. things that are measured).
 One variable is categorical and the other is quantitative, for instance height and gender. These are best compared by using sidebyside boxplots to display any differences or similarities in the center and variability of the quantitative variable (e.g. height) across the categories (e.g. Male and Female).
Comparing Two Categorical Variables
Understand that categorical variables either exist naturally (e.g. a person’s race, political party affiliation, or class standing), while others are created by grouping a quantitative variable (e.g. taking height and creating groups Short, Medium, and Tall). We analyze categorical data by recording counts or percents of cases occurring in each category. Although you can compare several categorical variables we are only going to consider the relationship between two such variables.
Example
The Class Survey data set, (CLASS_SURVEY.MTW or CLASS_SURVEY.XLS), consists of student responses to survey given last semester in a Stat200 course. We can construct a twoway table showing the relationship between Smoke Cigarettes (row variable) and Gender (column variable) using either Minitab or SPSS.
To create a twoway table in Minitab:
 Open the Class Survey data set.
 From the menu bar select Stat > Tables > Cross Tabulation and ChiSquare
 In the text box For Rows enter the variable Smoke Cigarettes and in the text box For Columns enter the variable Gender
 Under Display be sure the box is checked for Counts (should be already checked as this is the default display in Minitab).
 Click OK
To create a twoway table in SPSS:
 Import the data set
 From the menu bar select Analyze > Descriptive Statistics > Crosstabs
 Click on variable Smoke Cigarettes and enter this in the Rows box.
 Click on variable Gender and enter this in the Columns box.
 Click OK
This should result in the following twoway table:
The marginal distribution along the bottom (the bottom row All) gives the distribution by gender only (disregarding Smoke Cigarettes). The marginal distribution on the right (the values under the column All) is for Smoke Cigarettes only (disregarding Gender). Since there were more females (127) than males (99) who participated in the survey, we should report the percentages instead of counts in order to compare cigarette smoking behavior of females and males. This tells the conditional distribution of smoke cigarettes given gender, suggesting we are considering gender as an explanatory variable (i.e. a variable that we use to explain what is happening with another variable). These conditional percentages are calculated by taking the number of observations for each level smoke cigarettes (No, Yes) within each level of gender (Female, Male). For example, the conditional percentage of No given Female is found by 120/127 = 94.5%.
We can calculate these marginal probabilities using either Minitab or SPSS:
To calculate these marginal probabilities using Minitab:
 Opening the Class Survey data set.
 From the menu bar select Stat > Tables > Cross Tabulation and ChiSquare
 In the text box For Rows enter the variable Smoke Cigarettes and in the text box For Columns enter the variable Gender
 Under Display be sure the box is checked for Counts and also check the box for Column Percents.
 Click OK
To create a twoway table in SPSS:
 Import the data set
 From the menu bar select Analyze > Descriptive Statistics > Crosstabs
 Click on variable Smoke Cigarettes and enter this in the Rows box.
 Click on variable Gender and enter this in the Columns box.
 Click the tab labeled Cells and select column under Percentages.
 Click Continue
 Click OK
This should result in the following twoway table with column percents:
Although you do not need the counts, having those visible aids in the understanding of how the conditional probabilities of smoking behavior within gender are calculated. We can see from this display that the 94.49% conditional probability of No Smoking given the Gender is Female is found by the number of No and Female (count of 120) divided by then number of Females (count of 127). The data under Cell Contents tells you what is being displayed in each cell: the top value is Count and the bottom value is Percent of Column. Alternatively, we could compute the conditional probabilities of Gender given Smoking by calculating the Row Percents; i.e. take for example 120 divided by 209 to get 57.42%. This would be interpreted then as for those who say they do not smoke 57.42% are Females – meaning that for those who do not smoke 42.58% are Male (found by 100% – 57.42%).
Simpson’s Paradox
Hypothetically, suppose sugar and hyperactivity observational studies have been conducted; first separately for boys and girls, and then the data is combined. The following tables list these hypothetical results:
Notice how the rates for Boys (67%) and Girls (25%) are the same regardless of sugar intake. What we observe by these percentages is exactly what we would expect if no relationship existed between sugar intake and activity level. However, when we consider the data when the two groups are combined, the hyperactivity rates do differ: 43% for Low Sugar and 59% for High Sugar. This difference appears large enough to suggest that a relationship does exist between sugar intake and activity level. This phenomenon is known as Simpson’s Paradox, which describes the apparent change in a relationship in a twoway table when groups are combined. In this hypothetical example, boys tended to consume more sugar than girls, and also tended to be more hyperactive than girls. This results in the apparent relationship in the combined table. The confounding variable, gender, should be controlled for by studying boys and girls separately instead of ignored when combining. By definition, a confounding variable is a variable that when combined with another variable produces mixed effects compared to when analyzing each separately. By contrast, a lurking variable is a variable not included in the study but has the potential to confound. Consider the previous example where the combined statistics are analyzed then a researcher considers a variable such as gender. At this point gender would be a lurking variable as gender would not have been measured and analyzed.
Comparing Two Quantitative Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technique for comparing two quantitative variables. We plot on the yaxis the variable we consider the response variable and on the xaxis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a study. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research study is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship between two quantitative variables, we need to consider:
 Association/Direction (i.e. positive or negative)
 Form (i.e. linear or nonlinear)
 Strength (weak, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of 50 students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showing the relationship between Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student performance on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz performance be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
 Opening the Exam Data set.
 From the menu bar select Graph > Scatterplot > Simple
 In the text box under Y Variables enter Final and under X Variables enter Quiz Average
 Click OK
To create a scatterplot in SPSS:
 Import the data set
 From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
 Select the square Simple Scatter and then click Define.
 Click on variable Final and enter this in the Y_Axis box.
 Click the variable Quiz Average and enter this in the X_Axis box.
 Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association between Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not appear to be a change in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship between two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
 From the menu bar select Stat > Basic Statistics > Correlation
 In the window box under Variables Final and Quiz Average
 Click OK (for now we will disregard the pvalue in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
 Correlation is unit free. If we changed the final exam scores from percents to decimals the correlation would remain the same.
 Correlation, r, is limited to – 1 ≤ r ≤ 1.
 For a positive association, r > 0; for a negative association r < 0.
 Correlation, r, measures the linear association between two quantitative variables.
 Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is 0 but the two variables are obviously related.)
 The closer r is to 0 the weaker the relationship; the closer to 1 or – 1 the stronger the relationship. The sign of the correlation provides direction only.
 Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the change in y per unit change in x.
Two examples:
Data 1 
Data 2 

x 
y 
x 
y 

0 
3 
0 
13 

1 
5 
1 
11 

2 
7 
2 
9 

3 
9 
3 
7 

4 
11 
4 
5 

5 
13 
5 
3 
For the 'Data 1' the equation is y = 3 + 2x ; the intercept is 3 and the slope is 2. The line slopes upward, indicating a positive relationship between x and y.
For the 'Data 2' the equation is y = 13  2x ; the intercept is 13 and the slope is 2. The line slopes downward, indicating a negative relationship between x and y.


The relationship between x and y is 'perfect' for these two examples—the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be concerned with relationships between two variables which are not perfect. The 'Correlation' between x and y is r = 1.00 for the values of x and y on the left and r = 1.00 for the values of x and y on the right.
Regression analysis is concerned with finding the 'best' fitting line for predicting the average value of a response variable y using a predictor variable x.
APPLET Here is an applet developed by the folks at Rice University called "Regression by Eye". The object here is to give you a chance to draw what you this is the 'best fitting line".
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you decide which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct answer? 
Least Squares Regression
The best description of many relationships between two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Sir Francis Galton who in the mid 1800’s studied the phenomenon that children of tall parents tended to “regress” toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, b_{o} is the yintercept and b_{1} is the slope of the regression line.
Some questions to consider are:
 Is there only one “best” line?
 If so, how is this line found?
 Assuming we have properly fitted a line to the data, what does this line tell us?
By answering the third question we should gain insight into the first two questions.
We use the regression line to predict a value of for any given value of X. The “best” line would make the best predictions: the observed yvalues should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as small as possible. To accomplish this goal of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals appears as follows:
Residuals:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for b_{o} and b_{1}:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, S_{y} refers to the square root of the sum of squared deviations between the observed values of y and mean of y; similarly, S_{x} refers to the square root of the sum of squared deviations between the observed values of x and the mean of x.
Example: Exam Data set (Final.MTW or Final.XLS)
To perform a regression on the Exam Data we can use either Minitab or SPSS:
 From the menu bar select Stat > Regression > Regression
 In the window box by Response enter the variable Final
 In the window box by Predictors enter the variable Quiz Average
 Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
 Click OK and OK again.
Plus the following is the first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
 Import the data set
 From the menu bar select Analyze > Regression > Linear
 Click on variable Final and enter this in the Dependent box.
 Click the variable Quiz Average and enter this in the Independent box.
 Click OK
This should result in the following regression output:
WOW! This is quite a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you hang your mouse over various parts of the output in Minitab popups will appear with explanations.
The Output
From the output we see:
 Fitted equation is “Final = 12.1 + 0.751 Quiz Average”.
 A value of Rsquare = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
 The values under “T” and “P”, as well as the data under Analysis of Variance will be discussed in a future lesson.
 For the values under RESI1 and FITS1, the FITS are calculated by taking substituting the corresponding xvalue in that row into the regression equation to attain the corresponding fitted yvalue.
 What does the slope of 0.751 tell us? The slope tells us how y changes as x changes. That is, for this example, as x, Quiz Average, increases by one percentage point we would expect, on average, that the Final percentage would increase by 0.751 percentage points, or by approximately threequarters of a percent.
NOTE: Remember that the square root of a value can be positive or negative (think of the square root of 2). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the first value in the FITS column. Using this value, we can compute the first residual under RESI by taking the difference between the observed y and this fitted : 90 – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R^{2}
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we try to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R^{2}. In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R^{2} now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (xvariable) would have the same weight (yvariable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Yvalue minus its corresponding predicted Yvalue or . Therefore we would have as many residuals as we do y observations. The goal in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
Cautions about Correlation and Regression
Influence Outliers
In most practical circumstances an outlier decreases the value of a correlation coefficient and weakens the regression relationship, but it’s also possible that in some circumstances an outlier may increase a correlation value and improve regression. Figure 1 below provides an example of an influential outlier. Influential outliers are points in a data set that influence the regression equation and improve correlation. Figure 1 represents data gather on a persons Age and Blood Pressure, with Age as the explanatory variable. [Note: the regression plots were attained in Minitab by Stat > Regression > Fitted Line Plot.] The top graph in Figure 1 represents the complete set of 10 data points. You can see that one point stands out in the upper right corner, point of (75, 220). The bottom graph is the regression with this point removed. The correlation between the original 10 data points is 0.694 found by taking the square root of 0.481 (the Rsq of 48.1%). But when this outlier is removed, the correlation drops to 0.032 from the square root of 0.1%. Also, notice how the regression equation originally has a slope greater than 0, but with the outlier removed the slope is practically 0, i.e. nearly a horizontal line. This example is somewhat exaggerated, but the point illustrates the effect of an outlier can play on the correlation and regression equation. Such points are referred to as influential outliers. As this example illustrates you can see the influence the outlier has on the regression equation and correlation. Typically these influential points are far removed from the remaining data points in at least the horizontal direction. As seen here, the age of 75 and the blood pressure of 220 are both beyond the scope of the remaining data.
Correlation and Causation
If we conduct a study and we establish a strong correlation does this mean we also have causation? That is, if two variables are related does that imply that one variable causes the other to occur? Consider smoking cigarettes and lung cancer: does smoking cause lung cancer. Initially this was answered as yes, but this was based on a strong correlation between smoking and lung cancer. Not until scientific research verified that smoking can lead to lung cancer was causation established. If you were to review the history of cigarette warning labels, the first mandated label only mentioned that smoking was hazardous to your health. Not until 1981 did the label mention that smoking causes lung cancer. (See warning labels). To establish causation one must rule out the possibility of lurking variable(s). The best method to accomplish this is through a solid design of your experiment, preferably one that uses a control group.
Summary
In this lesson we learned the following: the relationship between the slope of the regression line and correlation,
 the meaning of the Coefficient of Determination, R^{2,}
 how to determine which variable is a response and which is an explanatory in a regression equation,
 that correlation measures the strength of a linear relationship between two variables,
 how outliers can influence a regression equation, and
 determine if variables are categorical or quantitative.
Next, let's take a look at the homework problems for this lesson. This will give you a chance to put what you have learned to use...