Statistical Hypothesis Testing

Fundamentals of Statistical Hypothesis Testing

Scientific hypotheses that are to be tested statistically are always phrased as in the form of a null hypothesis (written H_o). The null hypothesis is that there is no difference between the experimental and control groups (e.g., there are no differences due to sex or age, the drug is not poisonous, the treatment does not cure cancer, etc.). For example, in calculations involving the Chi-Square or F tests, typically the null hypothesis is that there is no difference between the means of the two groups compared. We write:

H_o: µ₁ = µ₂

(the null hypothesis is that the mean (µ) of group 1 equals the mean of group 2).

Statistical tests are attempts to reject the nullµ hypothesis. That is, we estimate the probability that the observed difference between two groups could have been obtained by chance alone. If this probability is less than some predetermined value called the significance level, usually 5% (or sometimes 1%) in biological studies, we say that we reject the null hypothesis. The result is then said to be statistically significant: this is the only correct usage of the term "significant" in science. We conclude that the experimental treatment, or the group difference, is biologically meaningful, and we proceed to analyze why this is so.

Consider a simple, non-biological example. You are tossing pennies with a Mississippi riverboat gambler, and you wonder whether the game is rigged. The null hypothesis is that the probability of heads or tails is equal on each toss (H_o: p_H = p_T). The probability of either result is 1/2: thus, the probabilities of getting 1, 2, 3 or 4 heads in a row are 1/2, 1/4, 1/8, and 1/16, respectively. The probability of the last result is 6.25%, which is greater than the 5% significance value, so a run of four heads is not quite enough evidence to cause us to be suspicious of the coin. However, if five successive tosses turn up heads (p = 1/32 = 3.125%), the probability is less than the predetermined significance level. We reject the null hypothesis at the 5% significance level, and we begin to suspect that the coin is loaded.

Thus, we make an evidence-based decision to accept or reject the null hypothesis at some pre-determined value. It is important to realize that this conclusion may or may not be correct. Our acceptance or rejection of an hypothesis, and the reality of the truth or falsity of the hypothesis, creates four possibilities, shown below.

Decision / Reality	True	False
Accept	OK	II
Reject	I	OK

The 'OK's indicate two ways of being "right." We may correctly conclude that there is no significant difference (we accept a true null hypothesis: the coin is honest, and we decide that it is), or we may correctly conclude that there is a significant difference (we reject a false null hypothesis: the coin is loaded, and we have found this out). On the other hand, there are also two ways of being "wrong." We may incorrectly conclude that there is a difference (we reject a true null: the coin is really honest, and the gambler was simply "lucky"), or we may incorrectly conclude that there is no difference (we accept a false null: for example, the coin may be loaded to give only 10% more heads, and we didn't get enough evidence in a short game to prove this).

The first kind of mistake is referred to as "Type I error", and the latter as "Type II error". In experimental biology, we are ordinarily most concerned to reduce the probability of Type I error: we do not wish to conclude that there is a difference unless we are very sure of the evidence. For example, we don't want to waste time trying to explain a phenomenon that may have occurred by chance, or waste money developing a new drug that doesn't cure cancer. In biology, the pre-determined significance level typically sets the upper limit of Type I error at 5%: thus the proportion of errors of this type does not exceed one in twenty. [We are talking here about the conscientious biologist: the quack doctor sells his cancer cure if there is the slightest evidence that it makes a difference].

On the other hand, certain types of biomedical experiments may be equally or even more concerned with Type II error. A physician does not want to prescribe a cancer drug unless she is certain about its effectiveness (she wants to minimize Type I error). At the same time, the pharmaceutical company that manufactures the drug will not market it if there is any evidence that it causes birth defects (it wants to minimize Type II error). A significance level of 1% or lower might be set in the clinical trials for effectiveness against cancer; a significance level as high as 50% might be set in the teratogenicity tests.

As noted above, a "significant" result means only that it differs from that expected by chance alone: it does not mean that the difference is large. An observed result may be (statistically) significant without being of large (biological) magnitude. The ability of a test to detect a difference of a particular magnitude is called the power of the test, and is typically dependent on the sample size. With a small sample size, only relatively large differences can be shown to be significant, whereas with extremely large samples, even very small differences can be shown to be significant. For example, comparison of 20 weasel skulls from the island of Newfoundland with twenty from the mainland is sufficient to show that island animals are on average 10% larger and that the result is significant, whereas samples of nearly a thousand inshore and offshore codfish show that a genetic difference of less 0.2% is significant. A practical consequence of the difference is that it is possible to assign a weasel skull to the correct population of origin with considerable accuracy, whereas success in re-assignment of a codfish to its source population is little better than chance.