Categorical variables are variables that represent groupings of some kind, such as gender, race, education level, or eye color. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things. For example, if we assign 1 to male and 2 to female, the number 1 does not mean that male is less than female in any way.

There are three types of categorical variables: binary, nominal, and ordinal variables⁴. Binary variables are categorical variables that can take on exactly two values, such as yes/no, true/false, or heads/tails. Nominal variables are categorical variables that can take on more than two values, but the values have no inherent order or ranking, such as colors, names, or countries. Ordinal variables are categorical variables that can take on more than two values, and the values have a meaningful order or ranking, such as grades, ratings, or stages.

When we want to summarize or analyze two categorical variables together, we can use a type of frequency table called a **contingency table**¹². A contingency table shows the frequency of each category in one variable, contingent upon the specific level of the other variable. That is, each combination of levels from each categorical variable are presented in a cell of the table. For example, if we want to summarize the relationship between gender and eye color, we can use a contingency table like this:

| Gender \ Eye Color | Blue | Brown | Green | Total |

|——————–|——|——-|——-|——-|

| Male | 10 | 15 | 5 | 30 |

| Female | 12 | 18 | 6 | 36 |

| Total | 22 | 33 | 11 | 66 |

This table shows us how many males and females have each eye color in our sample of 66 people. The cells where the categories cross are called observed frequencies. For example, the cell (1,2) shows us that there are 15 males with brown eyes in our sample. The cells at the margins of the table show us the total frequencies for each variable. For example, the cell (2,4) shows us that there are 36 females in our sample.

A contingency table is useful for summarizing two potentially related categorical variables because it allows us to see how the frequencies of one variable vary across the levels of another variable. For example, from the table above, we can see that blue eyes are more common among females than males in our sample. We can also compare the relative frequencies or proportions of each category within each level of another variable. For example, we can see that among males, brown eyes are the most common (15/30 = 0.5), while among females, brown eyes are also the most common (18/36 = 0.5).

However, a contingency table does not tell us whether the two categorical variables are statistically independent or dependent. That is, it does not tell us whether there is a significant association or relationship between them. To test this hypothesis, we need to use a statistical test called the **chi-square test for independence**. This test compares the observed frequencies in each cell of the contingency table with the expected frequencies under the assumption of independence. The expected frequencies are calculated based on the marginal totals and the sample size. For example, if gender and eye color are independent, then we would expect that 30/66 = 0.455 of all people with blue eyes are male. Therefore, in cell (1,1), we would expect 0.455 x 22 = 10.01 males with blue eyes.

The chi-square test for independence calculates a test statistic called chi-square (χ2) that measures how much the observed frequencies deviate from the expected frequencies. The larger the chi-square value, the more evidence there is against independence. The chi-square value is then compared with a critical value based on a probability distribution called chi-square distribution with a certain number of degrees of freedom (df). The degrees of freedom depend on the number of rows and columns in the contingency table. For example, if we have a 2 x 3 contingency table (two rows and three columns), then df = (2 – 1) x (3 – 1) = 2.

If the chi-square value is greater than or equal to the critical value at a given significance level (usually α = 0.05), then we reject the null hypothesis of independence and conclude that there is a significant association between the two categorical variables. If the chi-square value is less than the critical value at a given significance level, then we fail to reject the null hypothesis of independence and conclude that there is no significant association between the two categorical variables.

In summary, a contingency table is used to summarize two potentially related categorical variables, while a chi-square test for independence is used to test whether the two categorical variables are statistically independent or dependent. These are useful tools for exploring and analyzing categorical data in research and statistics.