Fundamentals of Statistical Hypothesis Testing

    Biological hypotheses that are to be tested statistically are always phrased as in the form of a null hypothesis (written Ho). The null hypothesis is that there is no difference between the experimental and control groups (e.g., there are no differences due to sex or age, the drug is not poisonous, the treatment does not cure cancer, etc.). For example, in calculations involving the Chi-Square or F tests, typically the null hypothesis is that there is no difference between the means of the two groups compared. We write:

Ho: µ1 = µ2

(the null hypothesis is that the mean of group 1 equals the mean of group 2).

    Statistical tests are attempts to reject the null hypothesis. That is, we estimate the probability that the observed difference between two groups could have been obtained by chance alone. If this probability is less than some predetermined value (the significance level, usually 5% or sometimes 1% in biological studies), we say that we reject the null hypothesis. The result is then said to be statistically significant: this is the only correct usage of the term "significant" in biology. We conclude that the experimental treatment, or the group difference, is biologically meaningful, and we proceed to analyze why this is so. [Note that failure to reject the null hypothesis does not mean that it is true; see below].

    Consider a simple, non-biological example. You are tossing pennies with a Mississippi riverboat gambler, and you wonder whether the game is rigged. The null hypothesis is that the probability of heads or tails is equal on each toss (Ho: pH = pT). The probability of either result is 1/2: thus, the probabilities of getting 1, 2, 3 or 4 heads in a row are 1/2, 1/4, 1/8, and 1/16, respectively. The probability of the last result is 6.25%, which is greater than the 5% significance value, so a run of four heads is not quite enough evidence to cause us to be suspicious of the coin. However, if five successive tosses turn up heads (p = 1/32 = 3.125%), the probability is less than the predetermined significance level. We reject the null hypothesis at the 5% significance level, and we begin to suspect that the coin is loaded.

    Thus, we make a decision to accept or reject the null hypothesis at some pre-determined value. It is important to realize that this conclusion may or may not be correct. Our acceptance or rejection of an hypothesis, and the reality of the truth or falsity of the hypothesis, creates four possibilities, shown below.
 

Decision / Reality
True
False
Accept
OK
II
Reject
I
OK

    The 'OK's indicate two ways of being "right." We may correctly conclude that there is no significant difference (we accept a true null hypothesis: the coin is honest, and we decide that it is), or we may correctly conclude that there is a significant difference (we reject a false null hypothesis: the coin is loaded, and we have found this out). On the other hand, there are also two ways of being "wrong." We may incorrectly conclude that there is a difference (we reject a true null: the coin is really honest, and the gambler was simply "lucky"), or we may incorrectly conclude that there is no difference (we accept a false null: for example, the coin may be loaded to give only 10% more heads, and we didn't get enough evidence in a short game to prove this).

    The first kind of mistake is referred to as "Type I error", and the latter kind as "Type II error". In experimental biology, we are ordinarily most concerned to reduce the probability of Type I error: we do not wish to state that there is a difference unless we are very sure of the evidence. For example, we don't want to waste time trying to explain a phenomenon that may have occurred by chance, or waste money developing a new drug that doesn't cure cancer. [We are talking here about the conscientious biologist: the quack doctor sells his cancer cure if there is the slightest evidence that it makes a difference]. In biology, our pre-determined significance level typically sets the upper limit of Type I error: thus the probability of this type of error does not exceed 5%.

    On the other hand, certain types of biomedical experimentals may be equally or even more concerned with Type II error. A physician does not want to prescribe a cancer drug unless she is certain about its effectiveness (she wants to minimize Type I error). At the same time, the pharmaceutical company that makes the drug will not market it if there is any evidence that it causes birth defects (it wants to minimize Type II error). A significance level of 1% or lower might be set in the clinical trials for effectiveness against cancer; a significance level as high as 50% might be set in the teratogenicity tests.


Text material © 2007 by Steven M. Carr