Chi-square Test of Independence

[A, SfS] Chapter 7: Chi-square tests and ANOVA: 7.2: Chi-square Test of Independence

Chi-square Test of Independence

Test of Independence for Categorical VariablesIn this section, we will discuss how we can assess whether there is an association between two categorical variables.

#\phantom{0}#

Contingency TableSuppose we have two categorical variables, #A# and #B#, and variable #A# has #I# categories while variable #B# has #J# categories.

Given a random sample, we can form a two-way frequency table, also called a contingency table, with #I# rows and #J# columns.

In cell #(i,j)# of the table we record the frequency #X_{ij}# of subjects in the sample who fall into the category #i# of variable #A# and category #j# of variable #B#.

We also find the marginal totals for both the individual rows and the individual columns of the table. Let #X_i# denote the total for row #i#, for #i = 1,...,I#, and let #X_j# denote the total for column #j#, for #j = 1,...,J#.

Finally, let #n# denote the grand total, which is the sample size.

The contingency table below summarizes the sample data of #200# individuals whose blood was tested in order to determine their blood group and rhesus type:

	A	B	AB	O	Total
Rhesus #+#	#68#	#18#	#6#	#76#	#\blue{168}#
Rhesus #-#	#12#	#4#	#2#	#14#	#\blue{32}#
Total	#\blue{80}#	#\blue{22}#	#\blue{8}#	#\blue{90}#	#\orange{200}#

#\phantom{0}#
The row and column totals of a contingency table are located in the margins (edges) of the table and are therefore referred to as #\blue{\textbf{marginal totals}}#.

The total number of observations used to construct a contingency table is called the #\orange{\textbf{grand total}}# and is found in the bottom-right corner of the table.

#\text{}#
A researcher might want to ask the following question about the population distribution for the two variables:

Are the variables #A# and #B# independent of each other within the population (i.e., for any member of the population, does knowing the category for one of the variables provide any clue about the category of the other variable)?

This is equivalent to asking: are the proportions of the population the same for each category of variable #A# across all categories of variable #B# (or vice versa)? This is called homogeneity.

For example, suppose we have two categorical variables, Major (science, social science, humanities) and Blood Type (O, A, B, AB).

Our intuition is that these two variables should not be associated. So the probability that a randomly-selected member of the population has blood type AB does not change if we know that the person is a science major.

Likewise, the probability that a randomly-selected member of the population is a science major does not change if we know that the person has blood type AB.

In this example, homogeneity would mean that, if, among science majors, #40\%# are type O, #25\%# are type A, #20\%# are type B, and #15\%# are type AB, then this exact same distribution of blood types should also be true for social science majors and for humanities majors.

Likewise, if, among students with blood type B, #35\%# are science majors, #40\%# are social science majors, and #25\%# are humanities majors, then this exact same distribution of majors should also be true for each of the other three blood types.

#\text{}#

Research Question and HypothesesThe research question of a chi-square test for independence is whether or not two categorical variables are associated.

For this research question, we test

#H_0 :# variables #A# and #B# are independent (not associated)

against

#H_1 :# variables #A# and #B# are dependent (associated)

When #H_0# is true, the two categorical variables are independent, which means that the joint probability that a randomly-selected case will be in category #i# of factor #A# and category #j# of factor #B# is the product of the marginal probabilities, that is:
\[P(A=i \cap B=j) = P(A=i)P(B=j)\]

Or, more succinctly:
\[p_{ij}=p_ip_j\]

Hence, in a random sample of #n# cases, the expected frequency #E_{ij}# in cell #(i,j)# of the contingency table is \[np_{ij} = np_ip_j\]

if the two factors are independent.

Since in practice, we do not know #p_i# or #p_j#, we estimate them from the data: for #p_i# we use #\cfrac{X_i}{n}#, where #X_i# is the total for row #i# of the table, and for #p_j# we use #\cfrac{X_j}{n}#, where #X_j# is the total for column #j# of the table.

Expected Frequencies, Test Statistic, and Null DistributionThe expected frequency in cell #(i,j)# is calculated as follows:\[E_{ij}=\cfrac{X_iX_j}{n}\]

The test statistic is: \[X^2 = \sum^I_{i=1}\sum^J_{j=1}\cfrac{(X_{ij} - E_{ij})^2}{E_{ij}}\]

This means that #\cfrac{(X_{ij} - E_{ij})^2}{E_{ij}}# is computed for each of the #IJ# cells of the frequency table, then added together.

If #E_{ij} \geq 5# for #i = 1,...,I# and #j = 1,...,J#, then it has been proven that #X^2# has an approximate #\chi^2_{(I-1)(J-1)}# distribution, i.e., the chi-square distribution with #(I-1)(J-1)# degrees of freedom.

If #E_{ij} < 5# for any cell, then categories can be combined until we have this condition met for every cell.

Note that the test of #H_0 : p_1 - p_2 = 0# against #H_1 : p_1 - p_2 \neq 0# is a simplified version of this test, with #I = 2# and #J = 2#.

Calculating the P-valueGiven an observed value #x^2# for the test statistic, the P-value is #P(X^2 \geq x^2)#, which is computed in #\mathrm{R}# using:

> pchisq(x^2,(I-1)(J-1),low=F)

If a significance level #\alpha# has been chosen, then #H_0# is rejected if the computed P-value is #\leq \alpha#, and #H_0# is otherwise not rejected.

For example, suppose we wish to know if the distribution of majors (among three possible majors) is the same across meat-eaters and vegetarians/vegans within the populations of students.

We wish to test:

#H_0 :# major and dietary preferences are independent

against

#H_1 :# there is an association between major and dietary preference

We collect a random sample of size #65#, and obtain the following contingency table:

	Meat	Veggie	Total
Major 1	#17#	#7#	#24#
Major 2	#14#	#9#	#23#
Major 3	#13#	#5#	#18#
Total	#44#	#21#	#65#

The expected frequencies #E_{ij}# under #H_0# are found using: \[E_{ij} = \cfrac{X_iX_j}{n}\]

For example, in cell #(1,1)#: \[E_{1,1} = \cfrac{(24)(44)}{65} \approx 16.246\]

We get the following expected frequencies (rounded to #3# digits):

	Meat	Veggie
Major 1	#16.246#	#7.754#
Major 2	#15.569#	#7.431#
Major 3	#12.185#	#5.815#

Note that #E_{ij} > 5# for all cells.

The test statistic is calculated as follows: \[\begin{array}{rcl}
x^2&=&\cfrac{(17 - 16.246)^2}{16.246}+\cfrac{(7 - 7.754)^2}{7.754}+\cfrac{(14 - 15.569)^2}{15.569}+\cfrac{(9 - 7.431)^2}{7.431}+ \cfrac{(13 - 12.185)^2}{12.185}\\
&&+\,\cfrac{(5 - 5.815)^2}{5.815}\\
&\approx& 0.7667\end{array}\]

which has #(3 - 1)(2 - 1) = 2# degrees of freedom.

The P-value is computed in #\mathrm{R}# using

> pchisq(0.7667,2,low=F)

to be #0.6816#, which is quite large.

So we do not reject #H_0#, and we conclude that major and dietary preference are independent.

#\text{}#

Using RTo implement the chi-square test for independence, the contingency table (without the marginal frequencies) must be created in #\mathrm{R}# as a matrix.

For example, in the previous example the contingency table was:

	Meat	Veggie
Major 1	#17#	#7#
Major 2	#14#	#9#
Major 3	#13#	#5#

In #\mathrm{R}#, we would create a matrix M containing these data:

> M = matrix(c(17,14,13,7,9,5),nrow=3)

Note that the matrix data are entered into a vector by column, i.e., first the data from column #1#, then the data from column #2#. You must also inform #\mathrm{R}# of either the number of rows #\mathtt{(nrow=3)}# or the number of columns #\mathtt{(ncol=2)}#.

This matrix is then the input to the #\mathtt{chisq.test()}# function:

> chisq.test(M)

which gives the chi-square test statistic, the degrees of freedom, and the P-value in its output. It will also give a warning if any of the expected frequencies are less than #5#.

To see the expected frequencies, use:

> chisq.test(M)$expected