Evolutionary Genetic Analysis with DNA data

Principles of Phylogenetic Systematics & Classification

Natural Classification" accurately reflects phylogeny
          Classification is a
hypothesis of evolutionary relationships

Inferring the nature of evolutionary relationship
   In a "tree," describe position of each 'twig tip' with respect to any / all others?
       Distance: amount of evolutionary change between twigs
          Or: How similar (close) are they?
              phenetic: distance measured between tips
                           "As the crow flies" from one twig to another
              patristic: distance measured along connecting branches
                          "As the ant runs" from one twig to another
      Relationship: pattern of connection between twigs
           How closely related are species?
               cladistic relationship: pattern of branching back to
                        Most Recent Common Ancestor
                        How do twigs join lower in tree?

   Phenetic (how similar are taxa?)
      versus cladistic (how closely related are taxa?) criteria

   Criteria agree, iff rates of evolution are constant
      If evolutionary rates differ, closely related organisms may appear dissimilar
       Ex.: Crocodiles more closely related to birds, but more similar (?) to lizards
                 Crocodilia resemble Squamata more than Aves (hence, Class Reptilia)
                 because avian ancestor(s) rapidly evolved specializations for flight

    Theoretical & technical breakthroughs late 1960s ~ 1980s:
        Theory of Phylogenetic Systematics formalized
        Molecular data (allozymes & DNA) replace morphology as primary data for phylogenetic inference
                Computational power increases
                DNA sequencing capacity increases

      ***Patterns of evolutionary relationship to be understood from molecular data;
                Patterns of organismal evolution
to be understood from relationships ***

Phenetic analysis with SNP DNA data
      Simplest measures:  # pairwise differences (p)
                                        % sequence similarity
                                        p-distance = (1 - S)

mtDNA distance matrix for Great Apes
                                         HOMEWORK: 5x5 Ape matrix

      Patterns of similarity inferred from UPGMA cluster analysis
         [Unweighted Pair Group Method, Arithmetic averaging],
            Sequential Agglomerative Hierarchical Nesting (SAHN) algorithm
           algorithm: set of instructions for repetitive task
                 In (n) x (n) matrix, join most similar pair:
                  re-calculate (n-1) x (n-1) matrix, re-join,
                     & so on, until last pair joined
      Clustering results shown as phenogram:
                   diagram of phenetic similarity
                  Similarity estimates relationships under certain assumptions

       UPGMA method assumes rates of evolution equal
            so branch tips "come out even" (contemporaneous)
            DNA sequences evolve as stochastic "Molecular Clock"

       HOMEWORK: Practice problems for UPGMA phenogram calculations

Alternative phenetic methods
       Neighbor-Joining (NJ) analysis does not assume rate equality
                 large rate differences lead to incorrect trees
             NJ allows branch lengths proportional to change: tips come out uneven
                 algorithm joins nodes, rather than tips
            More realistic, computationally harder                  

       Differential weighting of nucleotide substitutions
            accord greater 'significance' to 'important' changes
         Ex.: Kimura 2-parameter (K2P) model treats Transitions (Ts) & Transversions (Tv) differently
                Transition Bias  = [Ts] / [Tv]
                       Twice as many kinds of Tv as Ts: expect K = 0.5                            
                But: Tv rare for close comparisons,
                            more common for distant relationships
                K variable according to evolutionary problem under consideration:
                     K > 10 for close comparisons, K ~ 3 for moderate comparison
                    Tv-only for distant comparisons

Cladistic Analysis with SNP data

   Principles of homology & analogy applied to nucleotide changes
         Rely only on shared derived (synapomorphic) SNPs,
                 avoid shared ancestral (symplesiomorphic) SNPs,
                          SNPs unique to single taxa (autapomorphies),
                          convergent nucleotides between unrelated taxa (homoplasies).

     Choice of preferred hypothesis made on Principle of Maximum Parsimony
            Parsimony: simpler hypothesis preferred
            Ex.: If complex trait occurs in multiple species,
                         more parsimonious to hypothesize it evolved only once
                    => Trait evolved in single common ancestor

             Ex.: Evolution of ice-breeding in Phocidae ("True" seals),
                        from ecological & molecular parsimony perspectives

           Evolutionary parsimony:
                Hypothesis that requires fewer character changes preferred
                In molecular systematics, count SNP changes  

      "Four-Taxon Problem" & "Three-Taxon Statement":
         Four taxa A, B, C, & D have three hypotheses of relationship:
            A most closely related to B, or C, or D
Three Networks
         Evaluate alternative hypotheses as:
       "X and Y are more closely related to each other than either is to Z"
         Alternative hypotheses shown as networks with branches & internode

Count changes at  informative SNPs that favor each hypothesis
      Hypothesis with fewest changes is Maximum Parsimony explanation:
         AKA 'Minimum Length' or 'Minimum Spanning' solution

    Cladistic analyses weighted: objective criteria exist for DNA data
            Ex.: Count Tv:Ts as 3:1 => Tv are 3x as 'informative'
              or, count Tv only (Transversion parsimony) for "deep" analyses
              or, count 1st & 2nd position substitutions >> 3rd : replacement substitutions
HOMEWORK: What triplets are exceptions & why?

Alternative search strategies for large numbers of taxa
    Why not write out all possible trees, identify shortest?
    Because: Computational effort linear wrt # nucleotides

                                                     hyper-exponential wrt # taxa
       # networks mounts up:
             for t = 4, 5, 6, 7 taxa, # networks = 3, 16, 106, 945
       # bifurcating rooted trees for
t taxa = [(2t-3)!] / 2t-2(t-2)!]                              
             ex.: if t = 10, # trees = 2,027,025
if t = 21, # trees = 3.198 x 1023:  half of Avogadro's Number
                                [HOMEWORK: calculate the exact number]
if t = 52, # trees > Eddinger's Number, # of molecules in Universe (~1080)

  Heuristic methods seek approximate solutions
            for computationally difficult (impossible) problems
       Parable of Near-Sighted Mountain Climber

       Branch & Bound Search
       Branch-Swapping methods

Rooting a Tree:
Inferring direction of evolutionary change

  Evolutionary trees are networks with roots
With four taxa, network has four branches & one internode
      Root indicates relationship with common ancestor
          'root' can be placed on any branch or internode
          Thus five possible rooted trees (cladogram) for four-taxon network
              All equally parsimonious:
                not all place A & B as each others closest relatives
                Some make shared SNPs symplesiomorphic

      Outgroup rooting
         Include taxon known to be less closely related
            to any ingroup taxon than they are to each other
            Call this an outgroup
                 Ex.: Use feliform as outgroup to caniform problem
                        Note cladistic tree has same topology as NJ phenogram
                 Ex. Wolffish (Anarhichas): Johnstone et al. (2007)

      Midpoint rooting
         Place root halfway between two most divergent taxa
            Assumption: molecular evolution is clock-like

HOMEWORK: Practice four-taxon cladistic problems

Maximum Likelihood analysis

    Different approach to evolutionary trees based on Bayes Theorem
    Optimization criterion: How to choose 'correct ' solution

        Phenetic methods look for shortest tree
        Cladistic methods look for minimum number of events
        Likelihood methods look for most probable tree ("least unlikely" = "maximally likely"),
                given a priori model of evolutionary events

        E.g., given estimates of all possible SNP rates among A, C, G, & T (n = 12)
             Calculate probability of simultaneous occurrence
                    of all events necessary to produce any particular tree
             Any particular tree is (extremely) unlikely,
                    but some tree is least unlikely ( = maximally likely)
                    Ratio of likelihoods expresses how much better wrt any other

        Heuristic example: Consider game of five-card stud poker with standard 52-card deck
                                         Consider game of Fizzbin with unknown deck

Comparative Results of three phylogenetic methods for Five-taxon Panda Problem: NJ, MP, & ML methods

Statistical tests determine confidence in branching order
          Bootstrap Analysis: a re-sampling technique
                statistical tests usually involve replication / repetition of experiment:
                this is (?) inconvenient with DNA data
            Suppose sample data set of n bases accurately estimates parametric data (complete genome)
                  re-sample n sites (with replacement) ~3,000 times
                        repeat phylogenetic analysis on each 'new' set:
                        among all of these sets,
                              how often do same clades / clusters appear?
                    "50% bootstrap support" identifies groups that occur more frequently than all others combined
                     95% criterion desirable, sometimes not obtained with smaller data sets
                            cf. 1,140bp vs 11,582bp data sets

Download & install free MEGA X [Molecular Evolutionary Genetic Analysis] software

Lab Exercise: Are Giant Pandas (Ailuropoda) and Red (Lesser) Pandas (Ailurus) each others closest relatives?
        1,140 bp Cytochrome b data set (.meg format)
      15,582 bp mtDNA Coding Region data set (.meg format)  (ZIP file)

     15,600 bp mtDNA Coding Region, 12 families (.meg format) (annotation)

HOMEWORK: Results for the Panda Problem

                from UPGMA, Neighbor Joining, Maximum Parsimony, & Maximum Likelihood methods

Evolutionary genetic analysis of Newfoundland Caribou (Rangifer tarandus terranovae) (Wilkerson et al. 2018)
Phylogenetic analysis of codfish & relatives (Gadidae) (Coulson et al. 2006)
A molecular understanding of the evolutionary history of birds (Jarvis et al. 2014)
Applications to the evolution of COVID-19 SARS virus

Text material © 2021 by Steven M. Carr