### [A, SfS] Chapter 7: Chi-square tests and ANOVA: 7.4: Multiple Pairwise Comparisons

### Multiple Pairwise Comparisons

Multiple Pairwise ComparisonsIn this section, we will discuss how, given a continuous variable measured on #3# or more groups, we can determine whether the population means of this variable are different between pairs of groups, while controlling the probability of one or more Type I errors among the comparisons.

If we reject #H_0: \mu_1 = \cdots = \mu_I#, we then may want to know which pairs of population mean responses #\mu_i# and #\mu_j# are different. We can focus on one specific pair of means, or examine all the pairs at the same time. In the latter case, we use a *multiple comparisons* approach.

Fisher's Least Significant Difference MethodThe following procedure for conducting inference on a single pairwise difference is called **Fisher’s Least Significant Difference** (LSD) method.

Given two population mean responses #\mu_i# and #\mu_j# for two different treatment groups #i# and #j#, the pairwise difference #\mu_i - \mu_j# is estimated by #\bar{Y}_i - \bar{Y}_j#.

The estimated variance of this estimate is \[MSE \bigg(\cfrac{1}{n_i} + \cfrac{1}{n_j}\bigg)\] where #n_i# and #n_j# are the numbers of replicates in experimental treatment groups #i# and #j#, respectively, and #MSE# is the mean-square error from the ANOVA table.

The ratio \[\cfrac{(\bar{Y}_i - \bar{Y}_j) - (\mu_i - \mu_j)}{\sqrt{MSE(1/n_i + 1/n_j)}}\] has the #t_{N-I}# distribution if the ANOVA assumptions are valid.

Thus for any #\alpha# in #(0,1)#, a #(1 - \alpha)100\%# CI for #\mu_i - \mu_j# is: \[\Big(\bar{Y}_i - \bar{Y}_j - t_{N-I,\alpha/2}\sqrt{MSE(1/n_i + 1/n_j)},\,\,\,\bar{Y}_i - \bar{Y}_j + t_{N-I,\alpha/2}\sqrt{MSE(1/n_i + 1/n_j)}\Big)\]

If this CI does not contain #0#, then we have #(1 - \alpha)100\%# confidence that there is a difference between the two population means for the two groups.

We can test #H_0 : \mu_i - \mu_j = 0# against #H_1 : \mu_i - \mu_j \neq 0#, or either of the one-sided versons of this test, using the test statistic \[t = \cfrac{\bar{Y}_i - \bar{Y}_j}{\sqrt{MSE(1/n_i + 1/n_j)}}\] which has the #t_{N-I}# distribution under #H_0#.

The P-value is computed as #2P(T \geq |t|)# (or the appropriate one-sided version for a one-sided hypothesis), where #t# is the computed value of #T#.

For a given significance level #\alpha#, we would reject #H_0# and conclude that there is a difference between #\mu_i# and #\mu_j# if the P-value is no larger than #\alpha#. This is equivalent to the #(1 - \alpha)100\%# CI for #\mu_i - \mu_j# not containing zero.

\[\text{}\]However, if we want to investigate all pairwise differences #\mu_i - \mu_j# among the #I# treatment group population means simultaneously, we must confront the problem of *multiple testing*.

We want to maintain an *overall confidence level *of #1 - \alpha# among the #C = \binom{I}{2} = \cfrac{I(I-1)}{2}# CIs, so that we can say that we have #(1 - \alpha)\%# confidence that all #C# of the CIs contain the true value of the pairwise differences that they are estimating.

Or we might want to preserve an overall probability of a Type I error among the #C# hypothesis tests at significance level #\alpha#, so that we can say that the probability that we have made one or more Type I errors among the #C# tests is no more than #\alpha#.

But if each of the #C# simultaneous CIs has a confidence level of #1 - \alpha# then the overall confidence level among the #C# the comparisons can be as low as #1 - C\alpha#. And if each of the #C# simultaneous hypothesis tests has a probability of a Type I error at #\alpha#, then the overall probability of at least one Type I error among #C# hypothesis tests can be as high as #C\alpha# (or #1#, whichever is smaller). There are many ways of adjusting the Fisher's LSD method to address this problem.

One popular approach is the *Bonferroni method *for multiple comparisons, because it is simple.

Bonferroni Method for Multiple ComparisonsThe **Bonferroni method **uses Fisher's LSD formula to compute a CI for each of the #C# pairwise differences at overall confidence level #1 - \alpha#, but replaces #t_{N-I,\alpha/2}# with #t_{N-I,\alpha/(2C)}#, i.e., this method computes a #(1 - \alpha/C)100\%# CI for each pairwise difference.

Likewise, the Bonferroni method uses the Fisher’s LSD method to conduct a hypothesis test for each of the #C# pairwise differences at overall significance level #\alpha#, but either each of the #C# P-values is multiplied by #C#, or a significance level of #\cfrac{\alpha}{C}# is used for each of the #C# tests.

This method is very conservative, however, so it is not advisable if #I# is larger than #5#. Otherwise, you lose too much power.

For example, if we want an overall confidence level of #94\%# (i.e., #\alpha = 0.06#) and we have #4# groups, then we have #C = \binom{4}{2} = 6# pairwise comparisons, so we construct #6# CIs for the #6# pairwise differences, each at a confidence level of #\Big(1 - \cfrac{0.06}{6}\Big)100\% = 99\%#.

Or, if we want an overall Type I error probability of #\alpha = 0.06# among #6# tests for #6# pairwise differences, we would use a significance level of #\cfrac{0.06}{6} = 0.01# for each test, or, equivalently, we would keep a significance level at #0.06# for each test but multiply each P-value by #6#.

\[\text{}\] The *Tukey-Kramer* method for *Honestly Significant Differences* is a better approach for multiple comparisons.

Tukey-Kramer Method for Honestly Significant DifferencesThe **Tukey-Kramer method for Honestly Significant Differences** (HSD) is based on the *S**tudentized range distribution*, which has two values for its degrees of freedom, #I# and #N-I#.

Suppose #Q# has the Studentized range distribution with degrees of freedom #I# and #N-I#, and let #q_{I,N-I,\alpha}# denote the #1 - \alpha# quantile of the Studentized range distribution, i.e., the number #q_{I,N-I,\alpha}# satisfies \[P(Q \geq q_{I,N-I,\alpha}) = \alpha\]

Then for any #\alpha# in #(0,1)#, the CI for each pairwise difference #\mu_i - \mu_j# among #I# treatment groups at overall confidence level #1 - \alpha# is: \[\bigg(\bar{Y}_i - \bar{Y}_j - q_{I,N-I,\alpha}\sqrt{\cfrac{MSE}{2}\Big(\cfrac{1}{n_i} + \cfrac{1}{n_j}\Big)},\,\,\, \bar{Y}_i - \bar{Y}_j + q_{I,N-I,\alpha}\sqrt{\cfrac{MSE}{2}\Big(\cfrac{1}{n_i} + \cfrac{1}{n_j}\Big)}\bigg)\]

We can test #H_0 : \mu_i - \mu_j = 0# against #H_1 : \mu_i - \mu_j \neq 0# simultaneously for each pairwise difference among #I# treatment groups using the test statistic \[Q = \cfrac{\bar{Y}_i - \bar{Y}_j}{\sqrt{\cfrac{MSE}{2}\Big(\cfrac{1}{n_i} + \cfrac{1}{n_j}\Big)}}\] which has the Studentized range distribution with degrees of freedom #I# and #N-I# under #H_0#.

The P-value for each test is computed as #P(Q \geq |q|)#, where #q# is the computed value of #Q# for that test. Given some significance level #\alpha#, for each simultaneous hypothesis test we would reject #H_0# and conclude that there is a significant difference between #\mu_i# and #\mu_j# if the P-value is no larger than #\alpha#.

When the design is balanced, with #J# replicates per treatment group, we replace #\cfrac{MSE}{2} \left(\cfrac{1}{n_i} + \cfrac{1}{n_j}\right)# with #\cfrac{MSE}{J}# in the above formulas, and call it **Tukey’s method** instead.

\[\text{}\]

Using RTo conduct **Bonferroni **pairwise comparisons in #\mathrm{R}#, we can use the #\mathtt{pairwise.t.test()}# function. We demonstrate this using the same data we used in the previous chapter, which you can copy and paste into #\mathrm{R}#:

> Y = c(24.6,22.8,27.5,21.3,24.0,33.4,25.2,19.6,29.4,20.8,32.7,23.3,18.2,28.7,24.0)

> Group = c(2,1,3,1,2,3,2,1,3,1,3,2,1,3,2)

> Data = data.frame(Y, Group)

To conduct the simultaneous pairwise comparisons, you must first specify the dependent variable, then the factor:

`> pairwise.t.test(Data$Y, Data$Group, p.adj="bonf")`

The output is the lower triangle of a matrix which shows the adjusted P-value for the 3 individual pairwise comparisons:

1 | 2 | |

1 | 0.02474 | - |

2 | 6.7e-06 | 0.00061 |

Since each value is smaller than #0.025#, we would conclude #\mu_1 \neq \mu_2#, #\mu_1 \neq \mu_3#, and #\mu_2 \neq \mu_3# at overall significance level #0.025#. So we would accept a #2.5\%# probability that at least one of these three conclusions is an error.

Note that this function does not produce simultaneous CIs, so you would have to create these yourself using the formula provided.

For **Tukey-Kramer HSD **pairwise comparisons use:

`> TukeyHSD(Model)`

where model is the output from the #\mathtt{aov()}# function.

The output from #\mathtt{TukeyHSD()}# includes a #95\%# confidence interval (change the setting of #\mathtt{conf.level}# for a different confidence level) for the difference between each pair of population means, along with the P-value for each test:

> Model = aov(Y ~ factor(Group), data=Data)

> TukeyHSD(Model, conf.level=0.99)

This gives the output:

diff | lwr | upr | p adj | |

2-1 | 3.68 | -0.4775151 | 7.837515 | 0.0208759 |

3-1 | 9.80 | 5.6424849 | 13.957515 | 0.0000062 |

3-2 | 6.12 | 1.9624849 | 10.277515 | 0.0005516 |

These adjusted P-values are slightly different from those obtained using the Bonferroni method. If you wanted to create the Tukey-Kramer CIs yourself, you can compute #q_{I,N-I,\alpha}# in #\mathrm{R}# using:

`> qtukey(α, I, N-I, low=F)`

For **Fisher’s LSD **pairwise comparisons in #\mathrm{R}# you can also use the #\mathtt{pairwise.t.test()}# function, but use the setting #\mathtt{p.adj=}#"#\mathtt{none}#".

One thing to be aware of: You could find a statistically significant difference between the means of, for example, groups 1 and 3, but no statistically significant difference between the means of groups 1 and 2 and no statistically significant difference between the means of groups 2 and 3.

This seems to be contradictory: if group 2 has the same mean as both group 1 and group 3, then all 3 groups should have the same mean. But remember we are evaluating the strength of evidence against three different null hypotheses. It can be that the evidence is strong enough against one of the null hypotheses, but not against the other two, even if all three means are different.

**Example:**An article from 2005 presents pH measurements of soil specimens taken from three different types of soil. The data can be pasted into #\mathrm{R}# using the below code:

> Data = data.frame(pH=c(6.53,6.03,6.75,6.82,6.24,6.07,6.07,5.36,5.57,5.48,

5.27,5.80,5.03,6.65,6.03,6.16,6.63,6.13,6.05,5.68,6.25,5.43,6.46,6.91,5.75,

6.53), Type=c(rep("A",5),rep("G",9),rep("R",12)))

Here soil type A is Alluvium, soil type G is Glacial Till, and soil type R is Residuum.

In #\mathrm{R}# we first test for a difference in population mean PH among the #3# soil types:

> Model = aov(pH ~ factor(Type), data=Data)

> summary(Model)

The resulting P-value of #0.0104# indicates that the means are not all the same. So we would ordinarily follow this conclusion with post-hoc tests for pairwise differences among the #3# population means.

Let us use #\mathrm{R}# to conduct simultaneous tests for all pairwise differences in population means among the #3# soil types, at overall significance level #0.05#, using both the Bonferroni and the Tukey-Kramer methods.

**Solution:**

In #\mathrm{R}# we obtain the results of the Bonferroni method as follows:

`> pairwise.t.test(Data$pH, Data$Type, p.adj="bonf")`

This gives us the output:

A | G | |

G | 0.013 | - |

R | 0.608 | 0.072 |

The P-value for the difference in the population mean pH between Alluvium and Glacial Till falls below #0.05#, so we conclude that these two means are different #(p=0.013)#.

But the P-values for the other two comparisons are larger than #0.05#, so we conclude that the mean pH is the same for the pair Glacial Till and Residuum #(p=0.072)# and for the pair Residuum and Alluvium #(p=0.608)#.

Next we obtain the results of the Tukey-Kramer method as follows:

`> TukeyHSD(Model)`

This gives the output:

diff | lwr | upr | p adj | |

G-A | -0.7740 | -1.38710472 | -0.1608953 | 0.0116802 |

R-A | -0.3065 | -0.89159412 | 0.2785941 | 0.4030792 |

R-G | 0.4675 | -0.01720184 | 0.9522018 | 0.0601034 |

Since the Tukey-Kramer method is less conservative, and thus more powerful, the three P-values are smaller. But not to the extent that we would change any of our conclusions. Only Alluvium and Glacial Till have significantly different means for soil pH.

**Pass Your Math**independent of your university. See pricing and more.

Or visit omptest.org if jou are taking an OMPT exam.