Theory of Hypothesis Testing

[A, SfS] Chapter 6: Hypothesis Testing: 6.1: Theory of Hypothesis Testing

Theory of Hypothesis Testing

In this section, we will examine the purpose of hypothesis testing and how to use it correctly.

#\text{}#
In science, we are constantly building on previously-established knowledge using the scientific method. The scientific method proceeds very carefully in order to avoid making false claims. Based on established theory and observations of the environment, and based on a specific set of assumptions that define the situation being studied, a scientist makes a research hypothesis.

Research Hypothesis

A research hypothesis is a clear, succinct, and testable statement about a population.

The opposite of the research hypothesis is the null hypothesis.

Null Hypothesis

The null hypothesis simply declares that the research hypothesis is not true.

The scientific method requires a scientist to assume at the beginning of research that the null hypothesis is true, and thus that the research hypothesis is false. Then the research team acquires one or more representative samples of the populations about which the hypotheses are concerned, makes careful measurements on those samples, and analyzes the results using statistical tools.

If the data collected from the sample(s) are collectively consistent with the null hypothesis, then the scientist has failed to find sufficient evidence against the null hypothesis (even if the research hypothesis is in fact true). However, if the data are collectively quite inconsistent with the null hypothesis, and are therefore more consistent with the research hypothesis, then the evidence refutes the null hypothesis. Therefore, the scientist can decide to reject the null hypothesis in favor of the research hypothesis. It is always possible in such a case that the null hypothesis is in fact true but the data arose from an unusually unrepresentative sample (or from sloppy methodology!), so a scientist should never claim to have proved the research hypothesis. Proof requires successful replication of the result in multiple future studies by other scientists in the worldwide scientific community.

#\text{}#

Notation

In statistics, the null hypothesis is represented by the symbol #H_0#, and the research hypothesis (often called the alternative hypothesis) is represented by #H_1# or #H_a#.

Some examples of hypotheses we can investigate:

The mean yield #\mu# of rice grown in a certain region is greater than #50#kg/m#^2#.
- #H_0:\mu \leq 50#
- #H_1:\mu>50#.
The proportion #p_2# of patients who survive after receiving a new treatment for breast cancer is larger than the proportion #p_1# who survive after receiving an established treatment.
- #H_0: p_2 - p_1 \leq 0#
- #H_1: p_2 - p_1 > 0#.
There an association between the month of the year and the frequency of bicycle accidents.
- #H_0:# there is no association.
- #H_1:# there is an association.
Cholesterol levels among Dutch men over age #50# are not normally distributed.
- #H_0:# the distribution is normal.
- #H_1:# the distribution is not normal.

#\text{}#
A hypothesis test typically involves the computation of some relevant test statistic #T# which summarizes the essential information extracted from the entirety of the data collected on a random sample into a single number. To proceed further one must know (or be able to approximate) the sampling distribution of #T# when #H_0# is assumed true, i.e., the null distribution of #T#.

Null Distribution

The null distribution indicates which range of values of the test statistic is consistent with #H_0#, and which range of values is extreme (relative to the center of the null distribution) and therefore inconsistent with #H_0#.

Using the null distribution, we are able to calculate the P-value of the test.

P-value

The P-value of a hypothesis test is the probability (if #H_0# is indeed true) of observing values of #T# as extreme as or more extreme than the value actually observed.

Interpreting P-values

If the P-value is not small, then the value of #T# observed is not unusual when #H_0# is true, so one would not find empirical support for rejecting #H_0#.

However, if the P-value is small (i.e., close to 0), then the value of #T# is in a range of values that would be considered extreme when #H_0# is true, which might then lead one to reject #H_0# and conclude that there is statistically significant support for #H_1#.

If the P-value is #0.43#, then when #H_0# is true there is a #0.43# probability of observing values of #T# as extreme as or more extreme than the values of #T# which was computed from the data. This is not considered extreme, so it would be assumed that the observed value of #T# is consistent with #H_0#, taking into account the sampling variability of #T#. The evidence against #H_0# is not statistically significant in this case.

However, if the P-value is #0.001#, then there is only a #1# in #1000# chance of observing values of #T# as extreme as or more extreme than that which was computed from the data when #H_0# is true, so the observed value of #T# is not consistent with #H_0#, and the observed value of #T# can best be explained by #H_1# being true. The evidence against #H_0# is statistically significant in this case.

#\text{}#

Possible Outcomes of a Hypothesis Test

There are four possible outcomes of a hypothesis test:

		Reality
		#H_0# is true	#H_0# is false
Experimental Conclusion	Retain #H_0#	#\green{\text{True Negative}}#	#\red{\text{False Negative}}# (Type II error)
Experimental Conclusion	Reject #H_0#	#\red{\text{False Positive}}# (Type I error)	#\green{\text{True Positive}}#

#H_0# is true, and the researcher correctly does not reject #H_0#. This is a true negative.
#H_0# is true, but the researcher incorrectly rejects #H_0# and concludes #H_1#. This is called a Type I error, or a false positive.
#H_1# is true, but the researcher incorrectly does not reject #H_0#. This is called a Type II error, or false negative.
#H_1# is true, and the researcher correctly rejects #H_0# and concludes #H_1#. This is a true positive.

Of course, since the researcher does not know which hypothesis is actually true, the researcher cannot know whether or not an error is made upon reaching a conclusion. So the conclusion must be made with strong consideration of the consequences of an error. For instance, if people could die as a result of an error, then extreme caution is warranted.

Usually, the danger lies in making a Type I error. So a researcher must decide what level of probability of a Type I error can be risked.

Significance Level

The threshold on the probability of a Type I error is called the significance level, and is denoted #\alpha#.

A Type I error occurs if the test statistic #T# falls into the range of values considered extreme when #H_0# is true, yet #H_0# is in fact true. Recall that the P-value is the probability that the test statistic #T# falls into the range of values considered extreme when #H_0# is true.

Fixed-Level Testing

In order to make a decision regarding the null hypothesis of a test, researchers may use the following criterion:

If the P-value is #\leq \alpha#, reject #H_0# and conclude #H_1#; otherwise, do not reject #H_0#.

Using this criterion means that the probability of a Type I error is controlled. The probability of a Type I error is the significance level #\alpha#, a value selected by the researcher before collecting data in consideration of the risk of such an error.

Using this criterion is called fixed-level testing.

Commonly Chosen Significance Levels

Most researchers choose #\alpha = 0.05#, and thus accept a #0.05# probability of a false positive. If the risk of a false positive is too serious, then #\alpha = 0.01# is often chosen.

A word of caution is in order, however, as too many researchers think of these values as gold standards, when in fact they are completely arbitrary. Any value can be justified, based on the seriousness of the risk of a Type I error. However, due to backlash against the over-use of #\alpha = 0.05# as if it had been decreed by some deity, many researchers now only report the P-value and allow the readers of their results to assess for themselves whether or not there is sufficient empirical evidence to support the research hypothesis.

In general, if the research hypothesis is making a stronger claim, then the threshold for statistical significance should be more difficult to reach. Thus if we would test #H_1:\mu \neq 0# at level #\alpha = 0.05#, then we should test #H_1:\mu > 0# at a smaller level, such as #\alpha = 0.025#, since claiming that the mean is positive is a stronger claim than claiming that the mean is non-zero.

#\text{}#

The natural follow-up question is, can I also control the probability of a Type II error? This probability is represented by #\beta#, and #\beta# can be decreased by making it easier to reject #H_0#, i.e., by making the significance level #\alpha# larger. But this means increasing the probability of a Type I error! Hence, we have a dilemma.

Trade-off Between Type I and Type II Error Probabilities

If we decrease the probability of one type of error, we increase the probability of the other type.

Researchers choose the value of #\alpha#, and accept #\beta# as it is. But there is one other thing researchers have some control over that will cause #\beta# to decrease without affecting #\alpha#. If the variance of the test statistic #T# can be made smaller, then for any fixed choice of #\alpha# the value of #\beta# will decrease, because if the value of #T# falls far from the center of its null distribution then it is more likely due to the research hypothesis being true than due to the sampling variability of #T#.

The variance of #T# can be made smaller by increasing the sample size. The sample size is limited by constraints like available budget and time, but within those constraints a larger sample size means a lower probability of a Type II error.

#\text{}#

The goal of a hypothesis test is to determine whether you can find convincing evidence to support your research hypothesis if that hypothesis is actually true.

Power

The probability of rejecting the null hypothesis #(H_0)# when then research hypothesis #(H_1)# is true is called the power of the test, which we represent with #\pi#.
\[\pi = P\Big(\text{reject} \ H_0 \ | \ H_1 \ \text{is true} \Big)\]Recall that \[\beta = P\Big(\text{do not reject} \ H_0 \ | \ H_1 \ \text{is true}\Big)\] so \[\pi = 1 - \beta\]

A researcher wants the power of the selected hypothesis test to be as close to #1# as possible for a selected significance level #\alpha#. As mentioned previously, increasing the sample size will decrease #\beta#, and thereby increase the power #\pi#.

#\text{}#

Another factor that influences the power is the minimum effect size that the researcher hopes to find.

Effect Size

The effect size is the actual strength of the effect predicted by the research hypothesis.

If the effect size is larger, the power of the test will be higher.

A researcher usually has some idea what effect size if scientifically meaningful. Finding convincing evidence for a small effect size that is not meaningful in the scientific community, even if it confirms the research hypothesis, is generally not satisfactory.

For example, if a marathon runner announces that they have improved their time to run a marathon, it initially sounds like an achievement worth celebrating. But if the runner then clarifies that the improvement is #0.23# seconds, suddenly the achievement is not that impressive. But an improvement of #5# minutes would be.

Hence we want high power to detect a meaningful effect size, but lower power to detect trivial effect sizes.

A typical question a researcher will ask is:

”What is the minimum sample size I will need to achieve a power of #80\%# to detect an effect size of #2# if my significance level is #0.05#?”

Depending on the type of hypothesis test, the researcher can count on a good statistician to give a correct answer to the researcher’s question.

#\text{}#

There are many different kinds of research hypotheses, some of which concern the true value of a population parameter associated with some variable. In such cases, we call it a parametric hypothesis test.

Hypotheses of a Parametric Test

A parametric hypothesis test usually has the following form for the null and research hypotheses about a population parameter #\theta#:

#H_0: \theta \in \ominus_0# (the parameter #\theta# is in the set #\ominus_0#)

#H_1: \theta \notin \ominus_0# (the parameter #\theta# is not in the set #\ominus_0#)

The set #\ominus_0# could consist of only one real number, or it could be an interval of real numbers.

If this hypothesis test is conducted at significance level #\alpha#, then it has a direct connection to a #(1 - \alpha)100\%# one-sided or two-sided confidence interval for #\theta#.

Connection Between Hypothesis Testing and Confidence Intervals

If the confidence interval lies entirely outside of the set #\ominus_0#, then the hypothesis test will conclude in favor of #H_1# at significance level #\alpha#, and vice versa.

However, if the confidence interval includes any part of #\ominus_0# then the hypothesis test will conclude in favor of #H_0# at significance level #\alpha#, and vice versa.

Note that we have to use a one-sided confidence interval if the alternative hypothesis involves the symbols #># or #<#. Only if the null hypothesis involves #\neq# do we use a two-sided confidence interval.

For example, suppose we are interested in the population proportion #p# for some characteristic, and we have the following hypotheses:
\[H_0: p = 0.6 \\
H_1: p \neq 0.6\]

which we plan to test at significance level #\alpha = 0.01#. In this example #\ominus_0 = \{0.6\}#.

If, after collecting the data, we compute a two-sided #99\%# CI for #p# such as #(0.47,0.56)#, then we would conclude in favor of #H_1# in the hypothesis test, since this #99\%# CI only includes values that are smaller than #0.6#.

But if, after collecting data, we compute a #99\%# CI for #p# such as #(0.53,0.62)#, then we would conclude in favor of #H_0# in the hypothesis test, since this #99\%# CI includes #0.6#.

However, we will see in a later section that this correspondence is only exactly true in this setting if the CI is computed in a different way from the way introduced in the previous chapter.

For another example, suppose we are interested in the population mean #\mu# for some variable, and we have the following hypotheses:
\[H_0:\mu \leq 100 \\
H_1: \mu > 100\]

which we test at significance level #\alpha = 0.05#. In this example #\ominus_0 = (-\infty,100]#.

If, after collecting data, we compute a one-sided #95\%# CI for #\mu# such as #(102.3,\infty)#, then we would conclude in favor of #H_1# in the hypothesis test, since this one-sided #95\%# CI only includes values that are larger than #100#.

But if after collecting data we compute a one-sided #95\%# CI for #\mu# such as #(95.4,\infty)#, then we would conclude in favor of #H_0# in the hypothesis test, since this one-sided #95\%# CI includes values that are smaller than #100#.

The following sections will focus on some specific settings in which we conduct parametric hypothesis tests, each of which corresponds to the settings for confidence intervals that were covered in the previous chapter.