[A, SfS] Chapter 1: Sampling, Descriptive Statistics, Intr: 1.8: Quartiles
Quartiles
Quartiles
In this lesson you will learn:
- What quartiles are and how to calculate them.
- What the inter-quartile range is and how to calculate it.
- What the five-number summary is.
- What a boxplot is and how to construct it.
Another common method to get a better understanding of how a data set is distributed is to divide the data into a number of equal-sized parts. If the data set is divided up into four parts, the resulting cut-off points are called quartiles.
Quartiles
Quartiles divide a sorted data set into four parts, such that each part contains of all the elements in the data set.
Quartile Calculation
The calculation of quartiles starts by ordering the scores in the distribution from smallest to largest. Next, to find the index of the quartile, use the following formula:
where is the total number of scores in the data set and is a value between 1 and 3.
It is important to note that the formula above is used to determine the location of the quartile and not the value associated with it.
If is an integer, then the quartile is the score located at the position of the ordered data.
Whenever is not an integer, linear interpolation is used to calculate the quartile:
- Find the two integers closest to by rounding up and down. These indices are denoted by and , respectively.
- Determine the values located at these positions. These values are denoted by and , respectively.
- Calculate the quartile with the following formula:
To calculate the quartile, first sort the values in ascending order:
Next, to find the index of the quartile (), use the following formula:
Since is an integer, the quartile is the score located at the position of the ordered data:
One particularly useful measure that can be derived from the quartiles of a distribution is the inter-quartile range.
Inter-Quartile Range
The inter-quartile range is the difference between the first quartile and the third quartile of a distribution.
The inter-quartile range thus measures how spread out the middle of the data are.
Classifying Outliers
The can be used as a way to classify measurements as outliers. Under this convention, we say that the measurement is an outlier if:
- It is less than , or
- It is more than .
Some statisticians distinguish moderate outliers, identified using the criteria above, from extreme outliers by replacing the with .
Quartiles can also be used to construct a so-called five-number summary.
Five-Number Summary
The minimum value of a data set is thought of as the null quartile, denoted , and the maximum value is thought of as the fourth quartile, denoted .
The quartiles , , , and together are called the five-number summary for measurement on a quantitative variable.
Note that all quartiles are special cases of quantiles.
Quantiles
For quantitative data, the th quantile is any number such that percent of the data fall at or below that number.
The first quartile is the quantile, because of the data fall at or below the first quartile. The median is the quantile, and the third quartile is the quantile. The minimum and maximum can be thought of as the quantile and quantile, respectively.
Meanwhile, the quantile is a number such that of the data fall at or below it, and the quantile is a number such that of the data fall at or below it.
If then we can refer to the th quantile instead as the th percentile.
Percentiles
Percentiles divide a sorted data set into one hundred parts, such that each part contains of all the elements in the data set.
The quantile is the th percentile, and the quantile is the th percentile.
If in a national exam your score is at the percentile it means of the people who took the exam scored at or below your score (i.e., you did very well!).
We discussed previously that a histogram is a useful way to visualize the distribution of quantitative variable. Using the quartiles, there is an alternative to the histogram which is also commonly used: the boxplot.
Boxplot
A boxplot uses just one axis (usually vertical) to represent the numerical scale of the variable. The boxplot consists of a rectangle of arbitrary width with one side at and the other side at , and a thick line inside the box at (the median).
Then from the bottom of the plot, we extend a line to the minimum value of our data set, excluding any outliers, while from the top of the box we extend a line to the maximum value of our data set, excluding any outliers. These lines are often called whiskers. Any outliers are represented by separate points.
This boxplot shows that the measurements have a symmetric distribution with no outliers.
This boxplot shows that the data are skewed to the right (because the top whisker is longer than the bottom whisker, and the median is closer to to ), with outliers at the high end.
This boxplot shows that the data are skewed to the left with many outliers.
These are side-by-side boxplots. We can compare the same variable for two different groups ( and ).
We can see that the median of the variable in group is greater than the median of that variable in group , and we can also see that the spread of the measurements in group is much smaller than the spread in group . But both distributions appear symmetric, with one outlier in group .
Using R
Five-Number Summary
Suppose you have measurements on a quantitative variable stored in a numeric vector named in your workspace.
To find the minimum, first quartile, third quartile and maximum, respectively, the functions are:
> min(Salary)
> quantile(Salary,0.25)
> quantile(Salary,0.75)
> max(Salary)
However, you can obtain all of these quantities, and the mean and median, in one go, using the function:
> summary(Salary)
Inter-Quartile Range
You can find the inter-quartile range either by subtracting the first quartile from the third quartile, or more efficiently by using the function:
> IQR(Salary)
Quantiles
If you want the quantile of the data, i.e., the percentile, use the function:
> quantile(Salary,q)
Percentiles
If you want the percentiles from to incremented by tens, you could use:
> quantile(Salary,seq(0.1,0.9,by=0.1))
Boxplot
If you want to make a boxplot of this variable, use the function:
> boxplot(Salary)
Side-by-Side Boxplots
If you want side-by-side boxplots, with a separate boxplot for each of two or more groups situated next to each other, you need a second vector in which indicates to which group each measurement belongs. This vector can be a character vector or an integer vector. Suppose the vector is named . Then use the boxplot function in this manner:
> boxplot(Salary ~ Group)
Or visit omptest.org if jou are taking an OMPT exam.