Bio4900 - Phylogenetic Analysis of DNA Data

Phylogenetic Analysis of DNA data

A "Natural Classification" will accurately reflect phylogeny
Classification should be a hypothesis of evolutionary relationships

Inferring the degree of evolutionary relationship
   How can we describe the position of each 'twig' with respect to all others?
       distance: amount of change between twigs
          How similar (or different) are species?
           phenetic distance: distance measured between tips
                    (i.e., "as the crow flies" from one twig to another)
         patristic distance: distance measured along connecting branches
                    (i.e., "as the ant runs" from one twig to another)
      relationship: pattern of connection between twigs
            How closely related are species?
          cladistic relationship: pattern of branching back to most recent common ancestor (MRCA)
                    (i.e., where do twigs join lower in tree?)

Phenetic (how similar are taxa?)
versus cladistic (how closely related are taxa?) criteria

   These criteria agree, iff rates of evolution are constant
      If evolutionary rates differ, closely related organisms may appear different
       Ex.: Crocodiles are more closely related to birds, but more similar to lizards
                 Crocodiles resemble lizards more than birds
                 because birds rapidly evolved specializations for flight

Phenetic analysis
Simplest measure is % sequence similarity (S)
p-distance = (1 - S) x 100

      Patterns of similarity can be inferred from UPGMA cluster analysis
         [Unweighted Pair Group Method, Arithmetic averaging],
            a Sequential Agglomerative Hierarchical Nesting (SAHN) algorithm
            [algorithm = a set of instructions for doing a repetitive task]
         In (n) x (n) matrix, join the most similar pair :
            re-calculate (n-1) x (n-1) matrix, re-join,
             and so on, until last pair is joined
      Results are show as a phenogram:
         a diagram of phenetic similarity
      Similarity approximates an evolutionary relationships under certain assumptions
           cladogram: a diagram of evolutionary relationships (a tree)

       UPGMA method assumes that rates of evolution are equal
            so branch tips "come out even" (contemporaneous)
             DNA sequences evolve as a molecular clock

Homework: Five practice problems in UPGMA phenogram calculations

     Some alternatives:
       Neighbour-Joining (NJ) analysis does not assume rate equality
                 large evolutionary rate differences lead to incorrect trees
             NJ allows branch lengths proportional to change: tips come out uneven
               [algorithm joins nodes, rather than tips]
            This method is more realistic, computationally harder
                    [see www.megasoftware.net for free software]

       Differential weighting of nucleotide substitutions
            accord greater 'significance' to 'important' changes
         Ex.: Kimura 2-parameter distance (K2P) model treats Ts & Tv separately
                K transition bias = [Transitions] / [Transversions] = [Ts] / [Tv]
                        There are twice as many kinds of transversions as transitions:
                            expected K = 0.5
                But: Tv are rare for close comparisons,
                                     more common for distant relationships
                K is variable according to the evolutionary problem under consideration:
                     K > 6 for close comparisons

Cladistic Analysis
   Principles of homology & analogy can be applied to nucleotide changes
         We rely only on shared derived (synapomorphic) nucleotide sites,
         & avoid shared ancestral (symplesiomorphic) nucleotide sites,
                       and changes unique to single taxa (autapomorphies),
                       and convergent nucleotides between unrelated taxa (homoplasies).

     Choice of preferred hypothesis is made on the Principle of Parsimony
          In general: parsimony means that the simpler hypothesis is to be preferred
            Ex.: to explain the occurence of a complex structure in multiple species
                            it is more parsimonious to hypothesize that it has evolved only once
                            Therefore, the species have a common evolutionary origin
     Evolutionary parsimony:
                a hypothesis that requires fewer character changes is preferred
                In molecular systematics, these changes are nucleotide substitutions [SNPs]

      The "Four-Taxon Problem" and the "Three-Taxon Statement":
         Among four taxa A, B, C, & D, there are three hypotheses of relationship:
            either A is most closely related to B, or to C, or to D
      We want to be able to evaluate hypotheses of the form:
       "X and Y are more closely related to each other than either is to Z"
            The alternative hypotheses can be shown as networks with branches and an internode

Count number of changes at informative characters favouring each hypothesis
      The hypothesis with the "highest score" requires the fewest changes
         and is therefore the 'most parsimonious' explanation.
         This is also called the 'minimum length' or 'minimum spanning' solution.

[   Cladistic analyses may also be weighted: objective criteria exist for DNA data
            Ex.: Count Tv:Ts as 3:1 => Tv are 3x as meaningful
              or, count Tv only (Transversion parsimony) for "deep" analyses
              or, count 1st & 2nd position substitutions >> 3rd ]

Alternative search strategies are used with large numbers of taxa
    computational effort is linear wrt # of nucleotides
                                         hyperexponential wrt # of taxa
   # networks mounts up:
             for n = 4, 5, 6, 7 taxa, # networks = 3, 16, 106, >1,000
   # bifurcating rooted trees for m taxa = [(2m-3)!] / 2^m-2(m-2)!]
             ex.: if m = 10, # trees = 2,027,025
                    if m = 20, # trees = 8,200,794,532,637,891,559,375 = 8.2 x 10²⁷
Heuristic methods seek approximate solutions for computationally difficult problems
       Parable of the Near-Sighted Mountain Climber
       Branch & Bound Search
       Branch-Swapping methods

Homework: Practice four-taxon problems

Statistical tests determine confidence in branching order
          Bootstrap Analysis: a re-sampling technique
                statistical tests usually involve replication / repetition of experiment
                this is inconvenient with DNA data: $$$
            Suppose existing data set (400bp) is a accurate sample of parametric data set (complete genome)
                  re-sample existing n sites 1000 times, repeat phylogenetic analysis:
                        how often do same clades / clusters appear?
                    "50% bootstrap support" indicates group occurs more frequently than all others combined
                      95% criterion is desirable, not often obtained with small data sets

Placing the root & Inferring the direction of evolutionary change
          With four taxa, there are four branches and one internode
          The most closely related "sister taxon" may occur on any of these
               Where does this "common ancestor" fit in the tree?

An evolutionary tree is a network with a root:
      The root indicates the relationship with the common ancestor
          A 'root' can be placed on any of the branches or the internode.
          So, there are five possible rooted trees for this unrooted network.
              All are equally parsimonious:
              not all place A & B as each other's closest relatives.
              Some of these make shared charactes symplesiomorphic

      (1) Outgroup rooting:
         Include a taxon that is known to be less closely related
            to any of the ingroup taxa than they are to each other.
            Such a taxon is called an outgroup or sister taxon.
             Ex.: Lynx (Feloidea) is an outgroup to the Canoidea
                      (Note that this tree is equivalent to the NJ phenogram)
       Problematic in groups where relationships are uncertain (Ex. Wolffish (Anarhichis)

      (2) Midpoint rooting:
         Place the root halfway between the two most different taxa.
            This assumes that molecular evolution is clock-like.