Cluster Analysis with unequal branch lengths


Pair group methods such as UPGMA assume that branch lengths are equal, that is, in the present application that rates of evolutionary change are equal among lineages.  When this assumption is violated, conventional cluster methods give the wrong tree.

Consider five taxa (A, B, C, D, E) with the following distance matrix. Note that distances involving taxon C are unusually large:
 


A B C D E
A 0 - - - -
B 20 0 - - -
C 80 80 0 - -
D 60 60 100 0 -
E 80 80 120 80 0

As before, A & B are closest (20 units): join them into one cluster (AB) joining at 20, and recalculate other average distances as before.
 


(AB) C D E
(AB) 0 - - -
C 80 0 - -
D 60 100 0 -
E 80 120 80 0

(AB) & D are closest (60 units): join them into one cluster (ABD) joining at 60, and recalculate the average distances as before.
 


(ABD) C E
(ABD) 0 - -
C 90 0 -
E 80 120 0

E & (ABD) are closest (80 units): join them into one cluster (ABDE) joining at 80, and recalculate the average distance. This gives:
 


(ABDE) C
(ABDE) 0 -
C 105 0

C joins the remaining taxa at 105. This completes the analysis.

The analysis suggests that C is the least similar, and by implication the most distantly related, taxon to the other four (below, left), if similarity of ABDE estimates their relationship to C. In fact, the evolutionary tree from which the data were derived (below, right) shows that C is most closely related to (AB) [they have the most recent common ancestor], but has evolved at twice the rate of other taxa. The violation of the rate equality assumption of the method is guaranteed to give a wrong answer (below, left).

 

Example & text material © 2021 by Steven M. Carr