[A, SfS] Chapter 7: Chi-square tests and ANOVA: 7.2: Chi-square Test of Independence
Chi-square Test of Independence
Test of Independence for Categorical VariablesIn this section, we will discuss how we can assess whether there is an association between two categorical variables.
Contingency TableSuppose we have two categorical variables, and , and variable has categories while variable has categories.
Given a random sample, we can form a two-way frequency table, also called a contingency table, with rows and columns.
In cell of the table we record the frequency of subjects in the sample who fall into the category of variable and category of variable .
We also find the marginal totals for both the individual rows and the individual columns of the table. Let denote the total for row , for , and let denote the total for column , for .
Finally, let denote the grand total, which is the sample size.
The contingency table below summarizes the sample data of individuals whose blood was tested in order to determine their blood group and rhesus type:
A | B | AB | O | Total | |
Rhesus | |||||
Rhesus | |||||
Total |
The row and column totals of a contingency table are located in the margins (edges) of the table and are therefore referred to as .
The total number of observations used to construct a contingency table is called the and is found in the bottom-right corner of the table.
A researcher might want to ask the following question about the population distribution for the two variables:
Are the variables and independent of each other within the population (i.e., for any member of the population, does knowing the category for one of the variables provide any clue about the category of the other variable)?
This is equivalent to asking: are the proportions of the population the same for each category of variable across all categories of variable (or vice versa)? This is called homogeneity.
For example, suppose we have two categorical variables, Major (science, social science, humanities) and Blood Type (O, A, B, AB).
Our intuition is that these two variables should not be associated. So the probability that a randomly-selected member of the population has blood type AB does not change if we know that the person is a science major.
Likewise, the probability that a randomly-selected member of the population is a science major does not change if we know that the person has blood type AB.
In this example, homogeneity would mean that, if, among science majors, are type O, are type A, are type B, and are type AB, then this exact same distribution of blood types should also be true for social science majors and for humanities majors.
Likewise, if, among students with blood type B, are science majors, are social science majors, and are humanities majors, then this exact same distribution of majors should also be true for each of the other three blood types.
Research Question and HypothesesThe research question of a chi-square test for independence is whether or not two categorical variables are associated.
For this research question, we test
variables and are independent (not associated)
against
variables and are dependent (associated)
When is true, the two categorical variables are independent, which means that the joint probability that a randomly-selected case will be in category of factor and category of factor is the product of the marginal probabilities, that is:
Or, more succinctly:
Hence, in a random sample of cases, the expected frequency in cell of the contingency table is
if the two factors are independent.
Since in practice, we do not know or , we estimate them from the data: for we use , where is the total for row of the table, and for we use , where is the total for column of the table.
Expected Frequencies, Test Statistic, and Null DistributionThe expected frequency in cell is calculated as follows:
The test statistic is:
This means that is computed for each of the cells of the frequency table, then added together.
If for and , then it has been proven that has an approximate distribution, i.e., the chi-square distribution with degrees of freedom.
If for any cell, then categories can be combined until we have this condition met for every cell.
Note that the test of against is a simplified version of this test, with and .
Calculating the P-valueGiven an observed value for the test statistic, the P-value is , which is computed in using:
> pchisq(x^2,(I-1)(J-1),low=F)
If a significance level has been chosen, then is rejected if the computed P-value is , and is otherwise not rejected.
For example, suppose we wish to know if the distribution of majors (among three possible majors) is the same across meat-eaters and vegetarians/vegans within the populations of students.
We wish to test:
major and dietary preferences are independent
against
there is an association between major and dietary preference
We collect a random sample of size , and obtain the following contingency table:
Meat | Veggie | Total | |
Major 1 | |||
Major 2 | |||
Major 3 | |||
Total |
The expected frequencies under are found using:
For example, in cell :
We get the following expected frequencies (rounded to digits):
Meat | Veggie | |
Major 1 | ||
Major 2 | ||
Major 3 |
Note that for all cells.
The test statistic is calculated as follows:
which has degrees of freedom.
The P-value is computed in using
> pchisq(0.7667,2,low=F)
to be , which is quite large.
So we do not reject , and we conclude that major and dietary preference are independent.
Using RTo implement the chi-square test for independence, the contingency table (without the marginal frequencies) must be created in as a matrix.
For example, in the previous example the contingency table was:
Meat | Veggie | |
Major 1 | ||
Major 2 | ||
Major 3 |
In , we would create a matrix M containing these data:
> M = matrix(c(17,14,13,7,9,5),nrow=3)
Note that the matrix data are entered into a vector by column, i.e., first the data from column , then the data from column . You must also inform of either the number of rows or the number of columns .
This matrix is then the input to the function:
> chisq.test(M)
which gives the chi-square test statistic, the degrees of freedom, and the P-value in its output. It will also give a warning if any of the expected frequencies are less than .
To see the expected frequencies, use:
> chisq.test(M)$expected
Or visit omptest.org if jou are taking an OMPT exam.