[A, SfS] Chapter 3: Probability Distributions: 3.3: Normal Quantile-Quantile Plots
Normal Quantile-Quantile Plots
Normal Quantile-Quantile Plots
In this lesson you will learn about normal quantile-quantile plots; an informal way to check if data measured on a sample can be regarded as a sample of values from a normal distribution.
#\text{}#
Suppose you have a data sample like this:
\[37.22,\,\,\,36.84,\,\,\,36.00,\,\,\,35.89,\,\,\,34.10,\,\,\,35.26,\,\,\,36.43,\,\,\,37.52,\,\,\,35.59,\,\,\,37.39,\,\,\,34.98\]
and you want to know whether it is reasonable to assume that these data are sampled from a normal distribution. One tool for assessing this is the normal quantile-quantile plot. But what is this plot, and how is it constructed?
Normal Quantile-Quantile Plot
The idea of the normal quantile-quantile plot (Q-Q plot) is to show visually whether the quantiles of the data sample match the quantiles of the standard normal distribution (i.e., the normal distribution with mean #0# and standard deviation #1#).
We would not expect a perfect match, especially with a small sample, but if the match is very poor than we would have a strong reason to doubt that the data were sampled from a normal distribution.
#\text{}#
Quantiles
The (#\boldsymbol{k/n}#)th quantile of a data sample is any number that has #k# out of #n# sample observations less than or equal to it.
For any #p# between #0# and #1#, the #\boldsymbol{p}#th quantile of a population distribution for some variable is the number that has #p(100)\%# of the population values less than or equal to it.
For the standard normal distribution, the #0.5# quantile is zero, because #50\%# of the standardized population values are below zero.
For example, in a sample of #n = 50# subjects, the #(25/50)#th quantile, that is, the #0.5# quantile, is any number that has half the data values below it. We call this quantile the sample median.
To find the quantiles of the above sample of size #n = 11#, we first put the values in order from smallest to largest:
\[34.10,\,\,\, 34.98,\,\,\, 35.26,\,\,\, 35.59,\,\,\, 35.89,\,\,\, 36.00,\,\,\, 36.43,\,\,\, 36.84,\,\,\, 37.22,\,\,\, 37.39,\,\,\, 37.52\]
Then #34.10# is the #1/11#th (#\approx 0.091#th) sample quantile, #34.98# is the #2/11#th (#\approx 0.182#th) sample quantile, #35.26# is the #3/11#th (#\approx 0.273#th) sample quantile, and so on. Thus #37.52# is the #11/11#th (#=1.000#th) quantile.
The important point is: between any two consecutive sample quantiles lies #1/11# (about #9.1\%#) of the data.
In general: Between any two sample quantiles for a sample of size #n# lies #1/n# of the data.
#\text{}#
Comparing with the Standard Normal Distribution
To compare with the standard normal distribution, there are different possible approaches. Most software packages use the following approach:
1. If #n# is larger than #10#: Choose the consecutive numbers that would contain between them #1/n# of the total area under the standard normal density curve.
If #n# is odd, as in our example, the two middle equal-area strips are centered around #0#, so that #0# is the middle quantile.
If #n# is even, the middle equal-area strip is centered around #0#.
In either case, this leaves half of #1/n# of the area (that is, #1/2n# of the area) in each of the two tails. In general, given that #Z# has the standard normal distribution, the quantiles #q_1,...,q_n# are chosen so that
\[P(Z \leq q_k) = \cfrac{1 + 2(k - 1)}{2n}\]
for #k = 1,2,...,n#.
For our example above, these numbers are:
\[ -1.69,\,\,\,- 1.10,\,\,\,- 0.75,\,\,\,-0.47,\,\,\,-0.23,\,\,\,0.00,\,\,\,0.23,\,\,\,0.47,\,\,\,0.75,\,\,\,1.10,\,\,\,1.69\]
As you can see, these numbers are distributed symmetrically around #0#, and #0# is the middle number. Looking at the figure below, you can see how these numbers split up the area under the standard normal density curve into strips of equal area (each strip has area #1/11 \approx 0.091#, but the left-over area in each tail is half of #1/11#, i.e. #1/22 \approx 0.045#).
Thus, given that #Z \sim N(0,1)#:
\[\begin{array}{rcl}
P(Z \leq -1.69) &=& \cfrac{1}{22} \approx 0.045\\\\
P(Z \leq -1.10) &=& \cfrac{1}{22} + \cfrac{1}{11} = \cfrac{3}{22} \approx 0.136\\\\
P(Z \leq -0.75) &=& \cfrac{1}{22} + \cfrac{1}{11} + \cfrac{1}{11} = \cfrac{5}{22} \approx 0.227
\end{array}\]
and so on.
For comparison, the original ordered data (with the scale adjusted) are displayed on an axis below the axis of the standard normal density curve. If the match was perfect, the tick-marks on both axes would be perfectly aligned.
#\text{}#
In the normal Q-Q plot, the locations of these tick-marks are plotted against each other in a scatterplot, with a diagonal line added to aid with our visualization of the linearity of the match.
As can be seen in the Q-Q plot below, the match is linear enough to justify a conclusion that the data are sampled from a normal distribution. Note that the point on the far right of the Q-Q plot is far from the line, which corresponds to the large mismatch of the last pair of tick-marks on the two axes of the previous figure.
#\text{}#
2. If #n# is smaller than #10#: An adjustment is made so that the area of #\cfrac{4}{4n + 1}# lies between consecutive quantiles (slightly less than #1/n#). This pulls the quartiles a little bit towards the center.
In general, for #n \leq 10#, the quantiles #q_1,q_2,...,q_n# are chosen so that:
\[P(Z \leq q_k) = \cfrac{5 + 8(k - 1)}{8n + 2}\]
for #k = 1,2,...,n#.
For further understanding, check out https://www.youtube.com/watch?v=X9_ISJ0YpGw
#\text{}#
Using R
Normal Q-Q Plot
Given a vector of data in #\mathrm{R}# that represents measurements on a sample, you can make a normal quantile-quantile plot in #\mathrm{R}# very easily using the #\mathtt{qqnorm()}# command.
Suppose the name of the vector is #\mathtt{Data}#. Then to make the normal Q-Q plot, use:
> qqnorm(Data)
Add Diagonal
To add a diagonal line that represents where the points should lie if there is a perfect match between the sample quantiles and the quantiles of the standard normal distribution, follow the above command with:
> qqline(Data)
The following plot was created by first creating a vector named #\mathtt{Data}# by sampling #26# values from a #N(100,10^2)# distribution, then making a Q-Q plot of these data:
> Data = rnorm(26,100,10)
> qqnorm(Data,pch=20)
> qqline(Data)
As you can see, this is much easier than using the formulas shown in the previous slides. Now we can see that the points aren't exactly on the line, but they are close enough to the line that we can feel safe in concluding that the data were sampled from a normal distribution.
In contrast, suppose we take a sample of the same size from a chi-square distribution with #8# degrees of freedom, then use a normal Q-Q plot to assess whether the data could have been sampled from a normal distribution:
> Data2 = rchisq(26,8)
> qqnorm(Data2,pch=20)
> qqline(Data2)
Now we see especially at the lower-left corner that the first few points deviate quite a lot from the line, so we would suspect (correctly) that the data do not represent a sample from a normal distribution.
Or visit omptest.org if jou are taking an OMPT exam.