 The Chi-Square Test for Hardy-Weinberg Expectations

(1) For a locus with an A/G SNP polymorphisms, given the observed genotype counts that total to 200, the f(obs) of each genotype is (given) / (total). The count of alleles c(alleles) is then two for each homozygote plus one for each heterozygote, for example c(A) = (2)(40) + 70 = 150. Noting that the number of alleles is twice the number of genotypes, the frequency of alleles f(allele) is f(A) = 150 / 400 = 0.375. Then, f(G) = 250 / 400 = 0.625 or  (1 - f(A)) =  (1 - 0.375). Calculate the expected genotype frequencies from these observed allele frequencies: if p = f(A) and q = f(G), then p2 = f(AA), 2pq = f(AG), and q2 = f(GG). Expected genotype counts in each case are then (frequency) X (total observed).

The proportional deficiency of heterozygotes (F) can be calculated here from knowledge of the observed heterozygosity (0.350) and the expected heterozygosity (calculated as 2pq = (2) (0.375)(0.625) = 0.469, thus F = (0.350 - 0.469) / 0.469 = 0.253.
[A deficiency of heterozygotes is also called the Inbreeding Coefficient (F), if it is attributable to selective union of similar gametes, and (or) selective mating of similar genotypes. The math of this will be dealt with elsewhere].

Calculate the Chi-Square deviation (X2) contributed by each genotype as the difference between the observed and expected counts, divided by the expected count, quantity squared. For AA, (AAobs - AAexp)2 / (AAexp) = (40 - 28)2 / (40) = 5.120. The calculation is repeated for each genotype, and the Chi-Square value
X2 for the test is the sum over all genotypes, in this case 12.606.

To evaluate the statistical significance of this value, it is necessary to know the number of degrees of freedom (df) in the experimental data, which is reported and evaluated along with the result. In general, the df in any experiment is one less than the number of categories compared, (n-1). The principle is that, if you know you looked at n experimental results that could have fallen into any of three categories a b c, the value of the first category a can be anything (up to n), and the value of the second category b can be anything up to (n-a). Having determined a and b, the third value is now pre-determined: c = (n - a - b). So, only two of the three values are free to vary.

For diploid genotypic data with two alleles A & G and three genotypes AA, AG, & GG, this might suggest df = 2. However, from first principles and on reflection, if we know q = f(G), then f(A) = (1-q), and f(AA) = (1-q)2, f(AG) = (2)(1-q)(q) and f(GG) = q2, so that the expected values of the three genotype categories are pre-determined by the expected value of either one of the allele frequencies, which per-determines the other, and therefore df = 1. In general, for nucleotide or other allelic data, df = n-1 where n = # of alleles.

(2) The same principles apply to calculations for a three-allele, C / A / T SNP polymorphism. Note that the counts of alleles will involved three heterozygote classes each. Note that the calculation of heterozygosity can be done directly by adding the frequency of the (three) heterozygote classes directly, or by adding the frequencies of the (three) homozygote classes and subtracting the total from unity. However, as the number of alleles increases, becomes computationally more efficient to use the latter calculation

The Bio4250 Excel workbook includes a spreadsheet for calculation of SNP polymorphism for 2, 3, or 4 alleles.

Three further points
.

(3) Calculation of expected genotype counts from observed frequency data often results in expectations of non-integral numbers. Avoid 'fractional individuals'. If we were testing for a 3:1 genotypic ratio among 17 individuals, we cannot expect to see 12.75 and 4.25, so we round to the closest integer, here 13 and 4, which still adds to 17. This is applicable to the multiple-category data above. A related problem arises when, among 18 individuals, calculated expectations are 13.5 and 4.5: if we round both to 13 and 4, we are shy one expected and the test calculation is biased. One convention is to round one or the other expectation against the  trend seen in the data. That is, if we have a 3:1 hypothesis and observe 15 & 3, we round the expectations to 14 & 4. If we observed 12 & 6, we round the expectation to 13 & 5. This is not 'massaging' the data: the rounding reduces the likelihood of obtaining a significant result due to a computational bias, and increases the confidence in the result.

(4) Chi-square calculations must be performed with count data, not frequencies or percentages. Because it squares the magnitude of the deviation, X2 values are heavily influenced by the absolute magnitude of the numbers. For 0.6 & 0.4 observed versus 0.5 & 0.5 expected, X2 = 0.12 / 0.5 + (-0.1)2 / 0.5 = 0.02 / 0.5 = 0.04 ns, whereas with 60 & 40 versus 50 & 50 expected, X2 = 102 / 50 + (-10)2 / 50 = 200 / 50 = 4.0* , and with 600 & 400 versus 500 & 500 expected X2 = 1002 / 500 + (-100)2 / 500 = 20,000 / 500 = 40.0***. The proportional deviation is the same in each case (20%), but when the actual deviation is squared, and contributes much for strongly to X2 as n increases. This is also a reminder that larger samples sizes produce experiments with greater power.

(5) Pearson's Chi-square test procedure as given above is the classical method taught for genotypic comparisons of this sort. It is now appreciated that this test over-estimates the significance of any observed deviation, especially for small cell values. The G-Test can be used for the same purpose, and is included in the Biol4250 Excel workbook. Each cell value X in the two-column data matrix is transformed as X'
= XlnX: the test is a Row-by-Column (RxC) Test of Independence on the transformed data. The significance of the test is obtained from the Chi-Square table. The G-test has a number of advantages.

Text material © 2019 by Steven M. Carr