Comparing Two Categorical Variables
Understand that categorical variables either exist naturally (e.g. a person’s race, political party affiliation, or class standing), while others are created by grouping a quantitative variable (e.g. taking height and creating groups Short, Medium, and Tall). We analyze categorical data by recording counts or percents of cases occurring in each category. Although you can compare several categorical variables we are only going to consider the relationship between two such variables.
The Class Survey data set, (CLASS_SURVEY.MTW or CLASS_SURVEY.XLS), consists of student responses to survey given last semester in a Stat200 course. We can construct a two-way table showing the relationship between Smoke Cigarettes (row variable) and Gender (column variable) using either Minitab or SPSS.
The marginal distribution along the bottom (the bottom row All) gives the distribution by gender only (disregarding Smoke Cigarettes). The marginal distribution on the right (the values under the column All) is for Smoke Cigarettes only (disregarding Gender). Since there were more females (127) than males (99) who participated in the survey, we should report the percentages instead of counts in order to compare cigarette smoking behavior of females and males. This tells the conditional distribution of smoke cigarettes given gender, suggesting we are considering gender as an explanatory variable (i.e. a variable that we use to explain what is happening with another variable). These conditional percentages are calculated by taking the number of observations for each level smoke cigarettes (No, Yes) within each level of gender (Female, Male). For example, the conditional percentage of No given Female is found by 120/127 = 94.5%.
We can calculate these marginal probabilities using either Minitab or SPSS:
Although you do not need the counts, having those visible aids in the understanding of how the conditional probabilities of smoking behavior within gender are calculated. We can see from this display that the 94.49% conditional probability of No Smoking given the Gender is Female is found by the number of No and Female (count of 120) divided by then number of Females (count of 127). The data under Cell Contents tells you what is being displayed in each cell: the top value is Count and the bottom value is Percent of Column. Alternatively, we could compute the conditional probabilities of Gender given Smoking by calculating the Row Percents; i.e. take for example 120 divided by 209 to get 57.42%. This would be interpreted then as for those who say they do not smoke 57.42% are Females – meaning that for those who do not smoke 42.58% are Male (found by 100% – 57.42%).
Hypothetically, suppose sugar and hyperactivity observational studies have been conducted; first separately for boys and girls, and then the data is combined. The following tables list these hypothetical results:
Notice how the rates for Boys (67%) and Girls (25%) are the same regardless of sugar intake. What we observe by these percentages is exactly what we would expect if no relationship existed between sugar intake and activity level. However, when we consider the data when the two groups are combined, the hyperactivity rates do differ: 43% for Low Sugar and 59% for High Sugar. This difference appears large enough to suggest that a relationship does exist between sugar intake and activity level. This phenomenon is known as Simpson’s Paradox, which describes the apparent change in a relationship in a two-way table when groups are combined. In this hypothetical example, boys tended to consume more sugar than girls, and also tended to be more hyperactive than girls. This results in the apparent relationship in the combined table. The confounding variable, gender, should be controlled for by studying boys and girls separately instead of ignored when combining. By definition, a confounding variable is a variable that when combined with another variable produces mixed effects compared to when analyzing each separately. By contrast, a lurking variable is a variable not included in the study but has the potential to confound. Consider the previous example where the combined statistics are analyzed then a researcher considers a variable such as gender. At this point gender would be a lurking variable as gender would not have been measured and analyzed.