Association Between Categorical Variables

Learning objectives for this lesson

Upon completion of this lesson, you should be able to:


Determining Whether Two Categorical Variables are Related

The starting point for analyzing the relationship is to create a two-way table of counts. The rows are the categories of one variable and the columns are the categories of the second variable. We count how many observations are in each combination of row and column categories. When one variable is obviously the explanatory variable in the relationship, the convention is to use the explanatory variable to define the rows and the response variable to define the columns. This is not a hard and fast rule though.

Example 1 : Students from a Stat 200 course we're asked how important religion is in your life (very important, fairly important, not important). A two-way table of counts for the relationship between religious importance and gender (female, male) is shown below.

 
Fairly important
Not important
Very important
All
Female
56
32
39
127
Male
43
31
25
99
All
99
63
64
226

As an example of reading the table, 32 females said religion is "not important" in their own lives compared to 31 males. Fairly even!

Example 2: Participants in the 2002 General Social Survey, a major national survey done every other year, were asked if they own a gun and whether they favor or oppose a law requiring all guns to be registered with local authorities. A two-way table of counts for these two variables is shown below. Rows indicate whether the person owns a gun or not.

Owns Gun
Favors Gun Law
Opposes Gun Law
All
No
527
72
599
Yes
206
102
308
All
733
174
907

Percents for two-way tables

Percents are more useful than counts for describing how two categorical variables are related. Counts are difficult to interpret, especially with unequal numbers of observations in the rows (and columns).

Row Percents

Example 1 Continued : The term "row percents" describes conditional percents that give the percents out of each row total that fall in the various column categories.

Here are the row percents for gender and feelings religious importance:

 
Fairly important
Not important
Very important
All
Female
44.09
25.20
30.71
100.00
Male
43.43
31.31
25.25
100.00
All
43.81
27.88
28.32
100.00

Notice that row percents add to 100% across each row. In this example 32 / 127 = 25.20% of the females said religion was not very important in their own lives and 31.31% (31 / 99) of the males felt similarly. Notice that although the count was about even for both genders (32 to 31), the percentage is slightly higher for males.

Column Percents

The term "column percents" describes conditional percents that give the percents out of each column total that fall in the various row categories.

Example 2 Continued : Here are the column percents for gun ownership and feelings about stronger gun permit laws.

Owns Gun
Favors Gun Law
Opposes Gun Law
No
71.90
41.38
Yes
28.10
58.62
All
100.00
100.00

The column percents add to 100% down each column. Here, 28.10% of those who favor stronger permit laws own a gun, compared to 58.62% owning guns among those opposed to stronger permit laws.

Conditional Percents as evidence of a relationship

Definition : Two categorical variables are related in the sample if at least two rows noticeably differ in the pattern of row percents.

Equivalent Definition : Two categorical variables are related in the sample if at least two columns noticeably differ in the pattern of columns percents.


Statistical Significance of Observed Relationship / Chi-Square Test

The chi-square test for two-way tables is used as a guideline for declaring that the evidence in the sample is strong enough to allow us to generalize that the relationship holds for a larger population as well.


Using Minitab

We'll use Minitab to carry out the chi-square procedure so you will not have to know how to calculate the chi-square value or find the p-value by hand.

If you want to try this on your own in Mintab just open the Class Survey data (Class_Survey.MTW) and select [Note: file will not open unless the computer has Minitab]:

This will produce the output results given above for Example 1 and the following Chi-Square analysis.

Chi-square results for Example 1 (gender and feeling about religion):

Minitab version 14 gives results for two slightly different versions of the chi-square procedure. For the gender and feelings about religious importance, results reported by Minitab are

minitab output

All we need to do is find the p-value for the Pearson Chi-Square and interpret it. (The "Likelihood Ratio Chi-Square" statistic is another statistic calculated using a formula that differs from Pearson's. In this class we will always refer to the Pearson Chi-Square.) The value, 0.512, is above 0.05 so we declare the result to not be statistically significant. This means that we generalize that feelings about religious importance and gender are not related in a larger population such as PSU undergraduate students. This assumes that we consider our class survey to be a representative sample of all PSU undergraduate students. This means, for instance, that the students who participated in our survey are similar in make-up to the PSU undergrad population in regards to gender, race, GPA, and College.

Chi-square results for Example 2 (gun ownership and feeling about stronger permit laws):

For example 2, Minitab results are

minitab output

The p -value, 0.000, is below 0.05 so we declare the result to be statistically significant. This means that we can generalize that gun ownership and opinion about permit laws are related in a larger population.

Null and Alternative Hypotheses

The chi-square procedure is used to decide between two competing generalizations about the larger population.

Null hypothesis [written as Ho]: The two variables are independent.
Alternative hypothesis [written as Ha]: The two variables are dependent.

When the result is statistically significant (p-value less than 0.05) we pick the alternative hypothesis; otherwise, we pick the null.

Example 3: In the class survey described in Example 1, students were asked whether they smoke cigarettes. The following Minitab results help to show if and how males and females differ. Counts, row percents, and chi-square results are given.

minitab output

The percent that smoke is roughly doubled for males compared to females (10.10% versus 5.51%). The observed relationship is not statistically significant because the p-value, 0.194, is greater than 0.05. We decide in favor of the null hypothesis - smoking and gender are not related in the population of PSU undergraduate students.


Calculating the Chi-Square Test Statistic

The chi-square test statistic for a test of independence of two categorical variables is found by:

where O represents the observed frequency. E is the expected frequency under the null hypothesis and computed by:

From the previous output, you can see that the Expected Count for Females who said No was 117.45 which is found by taking the row total for Females (127) times the column total for No (209) then dividing by the sample size (226). This is procedure is conducted for each cell.

The chi-square given in each cell adds up to the chi-square test statistic of 1.684 Looking at the chi-square contribution of 0.05550 for Females who said No, this value is found by taking the squared difference between the Observed Count (120) and the Expected Count (117.45) then dividing by the Expected Count. The general concept is: if the expected and observed counts are not too different, then we would conclude that the two variables are not related (i.e. dependent). However, if the observed counts were much different than what would be expected if independence existed, we would conclude that there is an association (i.e. dependence) between the two variables.

NOTE: The * in the output are not relevant. They appear in the ALL areas because we do not calculate expected counts or chi-square values for the ALL categories.


Comparing Risks

We'll look at four different "risk" statistics.

Risk

The risk of a bad outcome can be expressed either as the fraction or the percent of a group that experiences the outcome.

Example : If the risk of asthma for teens is 0.06, or 6%, it means that 6% of all teens experience asthma.

Caution : When reading a risk statistic, be sure you understand exactly the group, and the time frame for which the risk is being defined. For instance, the statistic that the risk of heart disease for men is 0.40 (40%) obviously doesn't apply to college age men at this moment. It must have to do with a lifetime risk.

Relative Risk

Relative risk compares the risk of a particular outcome in two different groups. The comparison is calculated as Relative Risk = formula. Thus relative risk gives the risk for group 1 as a multiple of the risk for group 2.

Example: Suppose, 7% of teenage girls have asthma compared to 5% of teenage boys with asthma. Using the females as group 1, the relative risk for girls compared to boys = formula = 1.4. The risk of asthma is 1.4 times as great for teenage females as it is for teenage males.

Caution : Watch out for relative risk statistics where no baseline information is given about the actual risk. For instance, it doesn't mean much to say that beer drinkers have twice the risk of stomach cancer as non-drinkers unless we know the actual risks. The risk of stomach cancer might actually be very low, even for beer drinkers. For example, 2 in 1,000,000 is twice the size of 1 in a million but is would still be a very low risk.

Percent Increased (or decreased) Risk

Percent increased risk is calculated like the percent increase in a price. Two equivalent formulas can be used:

Percent increased risk = formula × 100%

OR

Percent increased risk = (relative risk - 1) ×100%

Example: Suppose, 7% of teenage girls have asthma compared to 5% of teenage boys with asthma. The percent increased risk for girls is formula ×100% = two fifths ×100% = 40% .

The risk of asthma (7%) is a 40% increase from the risk for boys (5%).

Alternatively, we found that the relative risk is 1.4 for these values. The percent increased risk could also have been computed as (relative risk - 1) ×100% = (1.4 - 1) ×100% = 40%.

Odds and Odds Ratios

The odds of an event = formula. In a sense, odds expresses risk by comparing the likelihood of a risky event happening to the likelihood it does not happen.

The odds ratio for comparing two groups = formula.

The value expresses the odds in group 1 as a multiple of the odds for group 2.

Example: Suppose, 7% of teenage girls have asthma compared to 5% of teenage boys with asthma. Odds ratio = formula The odds of asthma for girls are 1.29 the odds for boys.

Caution : Relative risk and odds ratio are often confused. Look carefully at comparisons of risks to determine if we're comparing odd or comparing risks.


Contingency Table Simulation

The following two-by-two table {called two-by-two because the table has two rows and two columns} will provide some insight into "seeing" how the distribution of cell counts affects the p-value, thus producing a significant result, plus give you some practice in calculating odds ratios and relative risks

Start by entering for Males the value 20 for Yes and then 40 for No. Enter these same values for Women and then click Compute. Since the distribution is identical, the odds ratio and relative risk are both one, and there is no statistically significant relationship between gender and sleep apnea (p-value = 1 > 0.05)

Now start changing the values for Women by adding to Yes and decreasing No by the same amount (say 10), and then repeat this step again. Note how the odds ratio and relative risk continue to decrease, while the Chi-square statistic and resulting p-value trend toward stronger statistical evidence (i.e. larger Chi-square and smaller p-value).

Continue to adjust and substitute numbers on your own while keeping track of how the distribution changes affect the results.

© 2007 The Pennsylvania State University. All rights reserved.