Learning objectives for this lesson
Upon completion of this lesson, you should be able to:
 Know what type of situations call for a chisquare analysis
 Begin to understand and apply the concept of "statistical significance"
 Calculate relative risk and odds ratio from a twobytwo table
 Explain the difference between relative risk and odds ratio
Determining Whether Two Categorical Variables are Related
The starting point for analyzing the relationship is to create a twoway table of counts. The rows are the categories of one variable and the columns are the categories of the second variable. We count how many observations are in each combination of row and column categories. When one variable is obviously the explanatory variable in the relationship, the convention is to use the explanatory variable to define the rows and the response variable to define the columns. This is not a hard and fast rule though.
Example 1 : Students from a Stat 200 course we're asked how important religion is in your life (very important, fairly important, not important). A twoway table of counts for the relationship between religious importance and gender (female, male) is shown below.
Fairly important 
Not important 
Very important 
All 

Female  56 
32 
39 
127 
Male  43 
31 
25 
99 
All  99 
63 
64 
226 
As an example of reading the table, 32 females said religion is "not important" in their own lives compared to 31 males. Fairly even!
Example 2: Participants in the 2002 General Social Survey, a major national survey done every other year, were asked if they own a gun and whether they favor or oppose a law requiring all guns to be registered with local authorities. A twoway table of counts for these two variables is shown below. Rows indicate whether the person owns a gun or not.
Owns Gun  Favors Gun Law 
Opposes Gun Law 
All 
No  527 
72 
599 
Yes  206 
102 
308 
All  733 
174 
907 
Percents for twoway tables
Percents are more useful than counts for describing how two categorical variables are related. Counts are difficult to interpret, especially with unequal numbers of observations in the rows (and columns).
Row Percents
Example 1 Continued : The term "row percents" describes conditional percents that give the percents out of each row total that fall in the various column categories.
Here are the row percents for gender and feelings religious importance:
Fairly important 
Not important 
Very important 
All 

Female  44.09 
25.20 
30.71 
100.00 
Male  43.43 
31.31 
25.25 
100.00 
All  43.81 
27.88 
28.32 
100.00 
Notice that row percents add to 100% across each row. In this example 32 / 127 = 25.20% of the females said religion was not very important in their own lives and 31.31% (31 / 99) of the males felt similarly. Notice that although the count was about even for both genders (32 to 31), the percentage is slightly higher for males.
Column Percents
The term "column percents" describes conditional percents that give the percents out of each column total that fall in the various row categories.
Example 2 Continued : Here are the column percents for gun ownership and feelings about stronger gun permit laws.
Owns Gun  Favors Gun Law 
Opposes Gun Law 
No  71.90 
41.38 
Yes  28.10 
58.62 
All  100.00 
100.00 
The column percents add to 100% down each column. Here, 28.10% of those who favor stronger permit laws own a gun, compared to 58.62% owning guns among those opposed to stronger permit laws.
Conditional Percents as evidence of a relationship
Definition : Two categorical variables are related in the sample if at least two rows noticeably differ in the pattern of row percents.
Equivalent Definition : Two categorical variables are related in the sample if at least two columns noticeably differ in the pattern of columns percents.
 In Example 1 Continued, row percents for females and males similar patterns. Feeling about religious importance and gender are NOT related.
 In Example 2 Continued, the two columns have clearly different sets of columns percents. Gun ownership and opinion about stronger gun permit laws are related.
Statistical Significance of Observed Relationship / ChiSquare Test
The chisquare test for twoway tables is used as a guideline for declaring that the evidence in the sample is strong enough to allow us to generalize that the relationship holds for a larger population as well.
 Definition: A statistically significant relationship is a relationship observed in a sample that would have been unlikely to occur if really there is no relationship in the larger population.
 Concept : A chisquare statistic for twoway tables is sensitive to the strength of the observed relationship. The stronger the relationship, the larger the value of the chisquare test.
 Definition : A p value for a chisquare statistic is the probability that the chisquare value would be as large as it is (or larger) if really there were no relationship in the population.
 IMPORTANT decision rule : An observed relationship will be called statistically significant when the pvalue for a chisquare test is less than 0.05. In this case, we generalize that the relationship holds in the larger population.
Using Minitab
We'll use Minitab to carry out the chisquare procedure so you will not have to know how to calculate the chisquare value or find the pvalue by hand.
If you want to try this on your own in Mintab just open the Class Survey data (Class_Survey.MTW) and select [Note: file will not open unless the computer has Minitab]:
 Stat > Tables > Cross Tabulation and ChiSquare.
 Enter Gender for the rows and Religious Importance for Columns.
 Be sure the box is checked for both Counts and Percents.
 Click the ChiSquare be sure that the boxes for ChiSquare analysis, Expected Counts, and Each Cells Contribution are checked.
 Then click OK and OK.
This will produce the output results given above for Example 1 and the following ChiSquare analysis.
Chisquare results for Example 1 (gender and feeling about religion):
Minitab version 14 gives results for two slightly different versions of the chisquare procedure. For the gender and feelings about religious importance, results reported by Minitab are
All we need to do is find the pvalue for the Pearson ChiSquare and interpret it. (The "Likelihood Ratio ChiSquare" statistic is another statistic calculated using a formula that differs from Pearson's. In this class we will always refer to the Pearson ChiSquare.) The value, 0.512, is above 0.05 so we declare the result to not be statistically significant. This means that we generalize that feelings about religious importance and gender are not related in a larger population such as PSU undergraduate students. This assumes that we consider our class survey to be a representative sample of all PSU undergraduate students. This means, for instance, that the students who participated in our survey are similar in makeup to the PSU undergrad population in regards to gender, race, GPA, and College.
Chisquare results for Example 2 (gun ownership and feeling about stronger permit laws):
For example 2, Minitab results are
The p value, 0.000, is below 0.05 so we declare the result to be statistically significant. This means that we can generalize that gun ownership and opinion about permit laws are related in a larger population.
Null and Alternative Hypotheses
The chisquare procedure is used to decide between two competing generalizations about the larger population.
Null hypothesis [written as H_{o}]: The two variables are independent.
Alternative hypothesis [written as H_{a}]: The two variables are dependent.
When the result is statistically significant (pvalue less than 0.05) we pick the alternative hypothesis; otherwise, we pick the null.
Example 3: In the class survey described in Example 1, students were asked whether they smoke cigarettes. The following Minitab results help to show if and how males and females differ. Counts, row percents, and chisquare results are given.
The percent that smoke is roughly doubled for males compared to females (10.10% versus 5.51%). The observed relationship is not statistically significant because the pvalue, 0.194, is greater than 0.05. We decide in favor of the null hypothesis  smoking and gender are not related in the population of PSU undergraduate students.
Calculating the ChiSquare Test Statistic
The chisquare test statistic for a test of independence of two categorical variables is found by:
where O represents the observed frequency. E is the expected frequency under the null hypothesis and computed by:
From the previous output, you can see that the Expected Count for Females who said No was 117.45 which is found by taking the row total for Females (127) times the column total for No (209) then dividing by the sample size (226). This is procedure is conducted for each cell.
The chisquare given in each cell adds up to the chisquare test statistic of 1.684 Looking at the chisquare contribution of 0.05550 for Females who said No, this value is found by taking the squared difference between the Observed Count (120) and the Expected Count (117.45) then dividing by the Expected Count. The general concept is: if the expected and observed counts are not too different, then we would conclude that the two variables are not related (i.e. dependent). However, if the observed counts were much different than what would be expected if independence existed, we would conclude that there is an association (i.e. dependence) between the two variables.
NOTE: The * in the output are not relevant. They appear in the ALL areas because we do not calculate expected counts or chisquare values for the ALL categories.
Comparing Risks
We'll look at four different "risk" statistics.
 Risk
 Relative Risk
 Percent Increase in Risk
 Odds Ratio
Risk
The risk of a bad outcome can be expressed either as the fraction or the percent of a group that experiences the outcome.
Example : If the risk of asthma for teens is 0.06, or 6%, it means that 6% of all teens experience asthma.
Caution : When reading a risk statistic, be sure you understand exactly the group, and the time frame for which the risk is being defined. For instance, the statistic that the risk of heart disease for men is 0.40 (40%) obviously doesn't apply to college age men at this moment. It must have to do with a lifetime risk.
Relative Risk
Relative risk compares the risk of a particular outcome in two different groups. The comparison is calculated as Relative Risk = . Thus relative risk gives the risk for group 1 as a multiple of the risk for group 2.
Example: Suppose, 7% of teenage girls have asthma compared to 5% of teenage boys with asthma. Using the females as group 1, the relative risk for girls compared to boys = = 1.4. The risk of asthma is 1.4 times as great for teenage females as it is for teenage males.
Caution : Watch out for relative risk statistics where no baseline information is given about the actual risk. For instance, it doesn't mean much to say that beer drinkers have twice the risk of stomach cancer as nondrinkers unless we know the actual risks. The risk of stomach cancer might actually be very low, even for beer drinkers. For example, 2 in 1,000,000 is twice the size of 1 in a million but is would still be a very low risk.
Percent Increased (or decreased) Risk
Percent increased risk is calculated like the percent increase in a price. Two equivalent formulas can be used:
Percent increased risk = × 100%
OR
Percent increased risk = (relative risk  1) ×100%
Example: Suppose, 7% of teenage girls have asthma compared to 5% of teenage boys with asthma. The percent increased risk for girls is ×100% = ×100% = 40% .
The risk of asthma (7%) is a 40% increase from the risk for boys (5%).
Alternatively, we found that the relative risk is 1.4 for these values. The percent increased risk could also have been computed as (relative risk  1) ×100% = (1.4  1) ×100% = 40%.
Odds and Odds Ratios
The odds of an event = . In a sense, odds expresses risk by comparing the likelihood of a risky event happening to the likelihood it does not happen.
The odds ratio for comparing two groups = .
The value expresses the odds in group 1 as a multiple of the odds for group 2.
Example: Suppose, 7% of teenage girls have asthma compared to 5% of teenage boys with asthma. Odds ratio = The odds of asthma for girls are 1.29 the odds for boys.
Caution : Relative risk and odds ratio are often confused. Look carefully at comparisons of risks to determine if we're comparing odd or comparing risks.
Contingency Table Simulation
The following twobytwo table {called twobytwo because the table has two rows and two columns} will provide some insight into "seeing" how the distribution of cell counts affects the pvalue, thus producing a significant result, plus give you some practice in calculating odds ratios and relative risks
Start by entering for Males the value 20 for Yes and then 40 for No. Enter these same values for Women and then click Compute. Since the distribution is identical, the odds ratio and relative risk are both one, and there is no statistically significant relationship between gender and sleep apnea (pvalue = 1 > 0.05)
Now start changing the values for Women by adding to Yes and decreasing No by the same amount (say 10), and then repeat this step again. Note how the odds ratio and relative risk continue to decrease, while the Chisquare statistic and resulting pvalue trend toward stronger statistical evidence (i.e. larger Chisquare and smaller pvalue).
Continue to adjust and substitute numbers on your own while keeping track of how the distribution changes affect the results.