Chapter 10: Analysis of Variance: One-way Analysis of Variance
One-way ANOVA: Model and Assumptions
One-way ANOVA Model
The statistical model that serves as the basis for the one-way ANOVA procedure is called the one-way ANOVA model and given by:
\[Y_i = \mu_i + \varepsilon\]
Here:
- #Y_i# is the value of the outcome variable #Y# for a randomly-selected observation from population #i#.
- #\mu_i# is the population mean for the outcome variable #Y# for level #i#.
- #\varepsilon# is called the error, by which we mean the amount by which the value of #Y# deviates from #\mu_i# for a randomly-selected observation from population #i#.
Assumptions of a one-way ANOVA
The following assumptions must be satisfied in order for a one-way ANOVA model to produce valid results:
- The dependent variable is normally distributed in each of the #i# populations being compared.
- Homogeneity of variances (homoscedasticity), meaning the population variance is the same across all populations.
- The observations must be independent, meaning:
- Random sampling is used to draw samples from the populations.
- No observation can be part of more than one sample.
- There is no relationship between the observations within each sample or between the samples.
- Random sampling is used to draw samples from the populations.
Both the assumption of normality and the homogeneity assumption can be checked by analyzing the residuals of the model.
For the one-way ANOVA model, a residuals #e_{ij}# is defined as the difference between an observed value #Y_{ij}# and its sample mean #\bar{Y}_i#:
\[e_{ij} = Y_{ij} - \bar{Y}_i\] If the residuals are normally distributed and the variance of the residuals is the same across all samples, then we can be confident that the normality and homogeneity assumptions hold for the underlying populations as well.
The assumption of independent observations relies on whether the data collection was done properly, and must thus be satisfied through proper research designs.
To informally check whether the assumption of normality is satisfied, we can construct a normal quantile-quantile (Q-Q) plot for the residuals.
Normal Quantile-Quantile PlotThe normal quantile-quantile (Q-Q) plot for the residuals is a scatterplot created by plotting the quantiles of the residuals against the theoretical quantiles of the standard normal distribution.
If the assumption of normality holds, we should see the points form a roughly straight line.
The following Q-Q plot of the residuals shows a good match between the quantiles of the residuals and the theoretical quantiles of the standard normal distribution:
In contrast, the following two Q-Q plots show a clear departure from normality:
To formally check whether the assumption of normality is satisfied, we can conduct the Shapiro-Wilk test for normality.
Shapiro-Wilk Test for Normality
The Shapiro-Wilk test for normality has the following set of hypotheses:
\[\begin{array}{rcl}
H_0 &:& \text{The residuals are nomally distributed.}\\\\
H_a &:& \text{The residuals are not normally distributed.}
\end{array}\]
A small #p#-value thus indicates that the residuals are not normally distributed, which in turn implies that the dependent variable is not normally distributed in each population being compared.
If the assumption of normality is violated, then any inference based on that assumption will be invalid.
To informally check whether the homogeneity of variances assumption is satisfied, we can construct a residual plot.
Residual Plot
A residual plot is a scatterplot of the residuals #e_{ij}# against the sample means #\bar{Y}_i#.
If the homogeneity of variances assumption holds, the residual plot should show approximately the same spread for each observed sample mean, and no extreme outliers.
Take a look at the following residual plot:
Here, we see that the points are scattered between #-15# and #15# across the full range of the sample means, with no pattern showing. Furthermore, the variance of the residuals does not appear to depend on the value of the independent variable, i.e., that we have homoscedasticity.
In contrast, the following residual plot shows clear evidence of heteroscedasticity:
Here we see that the vertical spread of the residuals grows larger as the sample means increase, which makes it obvious that the variance is not constant, but depends on the level of the factor. This invalidates any statistical inference we might make on the basis of the ANOVA model.
In the above plot, will see that there is a correlation between the magnitude of the residuals and the sample means, i.e., the larger the sample mean, the larger the average magnitude of the residuals.
To formally check whether or not the homogeneity of variances assumption is satisfied, we can conduct either Levene's test or Bartlett's test.
Levene's Test and Bartlett's Test
Both Levene's test and Bartlett's test have the following set of hypotheses:
\[\begin{array}{rcl}
H_0 &:& \text{All population variances are equal.}\\\\
H_a &:& \text{At least two population variances differ from each other.}
\end{array}\]
A small #p#-value thus implies that the homogeneity of variance assumption has been violated, and thereby invalidates any inference based on this assumption.
Of these two tests, Levene's test is least sensitive to departures from normality and is to be used whenever the assumption of normality is violated. If the assumption of normality is satisfied, however, then Bartlett's test has more statistical power.
Or visit omptest.org if jou are taking an OMPT exam.