This is a continuation of the post “The ultimate grad student guide to survive (and pass) qualifying exams“, in which you can find helpful advice collected from several grad students that were successful (or not so much) in their qualifying exams. As promised, here is the first sample of a quals question and answer under the format of the Ecology, Evolution and Systematics graduate program of the University of Missouri St Louis. The question and answer bellow is from my own qualifying exam, back in 2012, and it is supposed to be a question for a population biology major, but it could easily be systematics also. I didn’t edit anything from the version I sent to my committee, so if you find some wrong, that’s what my committee received! :P
What is meant by incomplete lineage sorting, and how does it affect assessments of relationship and species delimitation?
Figure 01: Hypothetical species tree and gene trees exemplifying a case of incongruent tree topology due to incomplete lineage sorting. The species tree for taxa A, B, C and D is showed at the top. Genes were sampled from species A, B, C and D, and are represented by the lines in the species tree. The respective gene trees are represented in the bottom. The gene represented by the continuous black line is a case of incomplete lineage sorting (first tree in the bottom line). If a tree is constructed based on the branching pattern of this gene, species B will share a common ancestor with species C more recently than with species A, which is the opposite prediction based on the species tree. The gene trees represented by the grey and the dashed lines have the same topology of the species tree, exemplifying cases of congruence among trees. Figure and legend adapted from Edwards (2009).
The branching pattern of a phylogeny tells the history of how species and genes evolved through time (Edwards 2009). This history can be constructed, for instance, by the comparison of morphological traits or DNA sequences. In the latter case, the number of nucleotide substitutions accumulated in the DNA gives an estimate of when the operational taxonomic units under comparison shared the same ancestor. However, the history of species and genes can differ from each other, generating incongruent trees (Pamilo and Nei 1988). Incongruence between trees can happen because species and genes may not have branched at the same time, or in other words, lineages may have failed to sort out at the same time speciation happened, a process named incomplete lineage sorting (Maddison 1997). Under the point of view of population genetics, when there is incomplete lineage sorting the coalescence time of genes and speciation time are different. Such difference in the time to coalesce means that if time is traced backwards in a branch of a phylogenetic tree, genes will not coalesce at the same time that speciation events will happen. Since the time needed to DNA sequences coalesce, or the time needed for such sequences to converge to a common ancestor (Charlesworth 2009), can be longer than speciation events, incomplete lineage sorting can also be referred as deep coalescence (Maddison 1997). When trees are constructed based on molecular sequences with incomplete lineage sorting, gene trees and species trees will present different topologies, showing distinct branching patterns, and influencing on the interpretations of species relationships and definitions. Figure 01 represents what a species tree and a gene tree looks like when incomplete lineage sorting is present.
There are two types of scenarios under which incomplete lineage sorting is more likely to happen: 1) large effective population size (Ne) (i.e. wide phylogenetic branches) and/or 2) few generations to divergence (short phylogenetic branches) (Maddison 1997). The effective population size (Ne) is the number of individuals in a population with equal probability to contribute with gametes for the next generation (Wright 1931, Avise 2004). The concept of Ne was first developed to compose predictions on the fate of genes in a population over time (Wright 1931), revealing how random sampling of allele frequencies in a population (i.e. genetic drift) influences the rate of evolutionary change (Charlesworth 2009). Genetic drift can also be viewed as a matter of statistical sampling error of alleles in a population, which is inversely related to the sample size (i.e. Ne) (Avise 2004). The effects of genetic drift are remarkable in small Ne, denoting that the chance of losing alleles at every generation is high. Therefore, incomplete lineage sorting is more likely to occur when ancestral populations present large Ne, since the action of genetic drift will not be significant, increasing the chance that alleles will not coalesce at the same time speciation occurs (Nichols 2001). Thus, trees with wide branches (i.e. small Ne) are less likely to present incomplete lineage sorting (Maddison 1997).
Figure 02: Probabilities (π) of survival of two or more founding lineages through time. Probability curves for populations of various sizes (N) are shown. Figure and legend adapted from Avise (2004).
Although lineage persistence is correlated with Ne, it is improbable that a lineage is able to persist for more than 4 Ne generations (Nichols 2001, Avise 2004). Figure 02 shows the probability of survival of lineages through time, depending on Ne. Transposing Figure 02 to a phylogenetic tree, it is possible to interpret that the wider (i.e. larger Ne) and the shorter (i.e. few generations) branches are, the higher the chances lineages will fail to sort out before speciation events (Maddison 1997, Maddison and Knowles 2006). Thus, divergence time, the number of generations taken until speciation, is the second contributing factor for incomplete lineage sorting occurrence. Conceptually, gene trees and species trees are not the same (Pamilo and Nei 1988), because even though both trees describe evolutionary histories, the former refers to orthologous genes (i.e. segregated by speciation), while the latter refers to evolutionary pathways of species, meaning that incongruence among these trees might not be considered as odd (Pamilo and Nei 1988). The probability of congruence among species trees and gene trees (P) can be derived as a direct function of the number of generations (T) using the equation P = 1 – 2/3e-T. In the equation, T is the number of generations between the more ancient and the more recent divergences, and it is given by the formula T = t/2(Ne), where t is the number of generations (Pamilo and Nei 1988, Rosenberg 2002, Figure 03). Large values of Ne and small values of t will reduce T, approximating the value of the term 2/3e-T to 1, and reducing the probability of congruence among topologies.
Figure 03: Relationship between the probability of congruent topology between species tree and gene trees (P) and intermodal branch length (T). Figure and legend adapted from Pamilo and Nei (1988).
If gene and species trees disagree due to incomplete lineage sorting, one can question what the consequences are for defining species and interpreting the relationships among them. The consequences are very straightforward, fitting in tree broad scenarios: 1) gene trees retrieve erroneous species trees, with unrealistic representations of taxa relationships, and/or 2) absence of reciprocal monophyly, meaning that alleles will be more related within paraphyletic than within monophyletic clades (i.e. contains a common ancestor and all its descendants) (Avise 2004). Incomplete lineage sorting causing uncertainty in species definitions was investigated by Heckman et al. (2007), who tested the phylogenetic hypothesis of eight species identity for mouse lemurs of Madagascar. Phylogenetic analysis of a single mitochondrial DNA (mtDNA) locus defined eight species for the genus of mouse lemurs, Microbeus, adding six new species to the group (Yoder et al. 2000). However, a multilocus analysis can provide stronger evidence for species divergence (Maddison 1997, Maddison and Knowles 2006, Zachos 2009). Applying a multilocus approach, Heckman et al. (2007) obtained incongruence when comparing trees obtained from mtDNA sequences and segregated nuclear loci. Monophyletic clades recovered from mtDNA sequences showed polyphyletic (i.e. clade derived from at least two ancestors) in trees retrieved from nuclear DNA data (Heckman et al. 2007). The incongruence is rooted in the fact that mtDNA has smaller Ne than nuclear DNA, and the latter is phylogenetically less informative than the former, due to lower mutation rates (Avise 2004). The authors attributed the mechanism of such incongruence to incomplete lineage sorting, since the species at the polyphyletic clade share polymorphisms at every nuclear locus analyzed, indicating that during Microcebus diversification, mtDNA haplotypes, but not nuclear alleles, sorted out before speciation (Heckman et al. 2007). However, when authors concatenated all gene sequences, they retrieved a tree with better support and resolution, shedding light to an alternative of how to deal with incomplete lineage sorting and obtain more reliable phylogenetic trees, a topic further discussed in this essay (Heckman et al. 2007).
The influence of incomplete lineage sorting in the interpretation of species relationships was investigated when genomes of humans and other primates were compared (Patterson et al. 2006). Genetic divergence between humans and chimpanzees varies between less than 84% and more than 147%, suggesting that incomplete lineage sorting might be the reason for lower divergence in some loci (Patterson et al. 2006). When the orangutan genome is added to the comparison, it reveals that incomplete lineage sorting happened approximately 1% of the time along the evolutionary history of these three species (Hobolth et al. 2011). More interestingly, in 0.8% of the genome, humans are more close to orangutans than they are to chimpanzees, and the later is more close to orangutans in 0.6% of the genome (Hobolth et al. 2011). The occurrence of incomplete lineage sorting in the phylogeny of these species can be explained by the fairly large Ne for the human-chimpanzee ancestor populations (Hobolth et al. 2007). Incomplete lineage sorting was also pointed out as the cause of incongruence when comparing the trees retrieved from the genome of species composing the Drosophila melanogaster complex (Pollard et al. 2006). Even though the phylogenetic analysis with full genome data of the four species in the complex generated a tree with better support, it was observed widespread incongruence among nucleotide and amino acid substitutions, insertions and deletions (i.e. indels), as well as gene trees (Pollard et al. 2006). It seems that species in the D. melanogaster complex suffered a rapid speciation event (i.e. low T value), which contributed to the maintenance of ancestral polymorphisms in the recently diverged species (Pollard et al. 2006). Despite that Pollard et al. (2006) successfully point out incomplete lineage sorting as the reason of incongruence among species and gene trees, the study does not attempt to control or incorporate such information to better understand the phylogenetic relationships among species.
When testing phylogenetic hypothesis, especially for recently diverged taxa, it is recommended to use approaches that can overcome the problems of misinterpretations due to retention of polymorphisms from ancestral lineages. The use of many genes sampled from each species was one of the first approaches suggested to deal with the absence of reciprocal monophyly among genes and species trees (Takahata 1989, Sanderson and Shaffer 2002). Also attempting to consider the effects of incomplete lineage sorting when retrieving consistent phylogenies, Maddison and Knowles (2006) reconstructed species trees using simulated of nucleotide sequences and their respective gene trees. They concluded that for shallow species trees (i.e. rapid species divergence) increasing the number of loci raises the chance of sampling various models of evolution, providing a more accurate species tree (Maddison and Knowles 2006). A systematic investigation of how multiple genes can improve phylogenetic inferences and solve problems of incongruence was conduced by Rokas et al. (2003), who analyzed trees recovered from 106 orthologous genes from eight yeast species of the genus Saccharomyces. High probability of incongruence was widespread among the analyzed genes, regardless if trees were retrieved from single or concatenated genes (Rokas et al. 2003). However, trees generated from at least 20 concatenated genes had bootstrap support above 95%, overwhelming the problems of inconsistency obtained by single genes (Rokas et al. 2003).
It has been suggested that phylogeny can be more well described by a statistical distribution (Maddison 1997). Considering species phylogeny as a probabilistic event, maximum likelihood has also been applied to obtain the species tree that offers the highest probability of finding the observed gene trees (Maddison 1997, Carstens and Knowles 2007, Wu 2011). The phylogenetic relationships of species from the genus Melanoplus of montane grasshoppers was better described by estimating species tree probabilistically from gene trees (Carstens and Knowles 2007). The five species in the genus, M. montanus, M. oregonensis, M. marshalli and M. triangularis, recently radiated in the Pleistocene, present distinct morphology and distribution, but have unresolved species relationships (Carstens and Knowles 2007). Five alleles per species, one mitochondrial and four nuclear, were sampled to generate gene trees using maximum likelihood. Trees were also generated considering the probability of incomplete lineage sorting, by applying a model of stochastic loss of lineages through genetic drift, elaborated as a function of Ne and number of generations to divergence (t) (Carstens and Knowles 2007). In this study, the method for obtaining the species tree proved to be consistent when applying the same procedures to simulated nucleotide sequences (Carstens and Knowles 2007). The best estimated phylogenetic species tree had high accuracy and support in comparison to previously obtained phylogenies (Carstens and Knowles 2007).
Incomplete lineage sorting is a widespread phenomenon and can provide useful insights on the population size of ancestors, speed of species divergence, as well as comparative information on how different genes evolved through time, shedding light on how different selection pressures acted on genomes through the evolutionary time (Nichols 2001). The failure of lineages to sort out along evolutionary history is associated with reduced Ne and rapid species divergence. The use of multiple loci of both mitochondrial and nuclear origins seems to provide enough evolutionary variability to reproduce consistent species phylogenies. Although incomplete lineage sorting can mess phylogenetic inferences, when such phenomenon is recognized, and strategies that reduce problems of tree congruence are incorporated, the evolutionary history of species can be revealed with more accuracy. Considering how incomplete lineage sorting, among other factors, can generate incongruent evolutionary histories, Maddison (1997) makes an insightful analogy about phylogenetic trees and electrons. In physics, there is a probability associated with the presence of electrons around the nucleus of an atom, meaning that electrons can be found in more than one place at once. So can phylogenies. Depending on the genes sampled, phylogenetic history can be found in different places at the same time. Thus, the same way electrons can be described as a probabilistic cloud of occurrence around an atom, a phylogeny can be viewed as a diffuse cloud of gene histories (Maddison 1997). The history of how species evolved through time, and appropriate hypothesis tests on species relationships can only be successfully achieved if the chance of occurrence of incomplete lineage sorting is considered and properly incorporated in the phylogenetic inferences.
Avise, J. 2004. Molecular markers, natural history, and evolution. Sinauer Associates, Sunderland. 684 pages, 2nd edition.
Carstens, B. C., and L. L. Knowles. 2007. Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers. Systematic Biology 56:400–411.
Charlesworth, B. 2009. Fundamental concepts in genetics: Effective population size and patterns of molecular evolution and variation. Nature Reviews Genetics 10:195–205.
Edwards, S. V. 2009. Is a new and general theory of molecular systematics emerging? International Journal of Organic Evolution 63:1–19.
Heckman, K. L., C. L. Mariani, R. Rasoloarison, and A. D. Yoder. 2007. Multiple nuclear loci reveal patterns of incomplete lineage sorting and complex species history within western mouse lemurs (Microcebus). Molecular Phylogenetics and Evolution 43:353–367.
Hobolth, A., J. Y. Dutheil, J. Hawks, M. H. Schierup, and T. Mailund. 2011. Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. Genome Research 21:349–356.
Hobolth, A., O. F. Christensen, T. Mailund, and M. H. Schierup. 2007. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genetics 3:e7.
Maddison, W. P. 1997. Gene trees in species trees. Systematic Biology 46:523–536.
Maddison, W. P., and L. L. Knowles. 2006. Inferring phylogeny despite incomplete lineage sorting. Systematic Biology 55:21–30.
Nichols, R. 2001. Gene trees and species trees are not the same. Trends in Ecology & Evolution 16:358–364.
Pamilo, P., and M. Nei. 1988. Relationships between gene trees and species trees. Molecular Biology and Evolution 5:568–583.
Patterson, N., D. J. Richter, S. Gnerre, E. S. Lander, and D. Reich. 2006. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441:1103–1108.
Pollard, D. A., V. N. Iyer, A. M. Moses, and M. B. Eisen. 2006. Widespread discordance of gene trees with species tree in Drosophila: Evidence for Incomplete Lineage Sorting. PLoS Genetics 2:e173.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804.
Rosenberg, N. A. 2002. The Probability of Topological Concordance of Gene Trees and Species Trees. Theoretical Population Biology 61:225–247.
Sanderson, M. J., and H. B. Shaffer. 2002. Troubleshooting molecular phylogenetic analyses. Annual Review of Ecology and Systematics:49–72.
Takahata, N. 1989. Gene genealogy in three related populations: consistency probability between gene and population trees. Genetics 122:957–966.
Wright, S. 1931. Evolution in Mendelian Populations. Genetics 16:97–159.
Wu, Y. 2011. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. International Journal of Organic Evolution 66:763–775.
Yoder, A. D., R. M. Rasoloarison, S. M. Goodman, J. A. Irwin, S. Atsalis, M. J. Ravosa, and J. U. Ganzhorn. 2000. Remarkable species diversity in Malagasy mouse lemurs (primates, Microcebus). Proceedings of the National Academy of Sciences of the United States of America 97:11325–11330.
Zachos, F. E. 2009. Gene trees and species trees–mutual influences and interdependences of population genetics and systematics. Journal of Zoological Systematics and Evolutionary Research 47:209–218.