No Arabic abstract
A number of methods have been developed to infer differential rates of species diversification through time and among clades using time-calibrated phylogenetic trees. However, we lack a general framework that can delineate and quantify heterogeneous mixtures of dynamic processes within single phylogenies. I developed a method that can identify arbitrary numbers of time-varying diversification processes on phylogenies without specifying their locations in advance. The method uses reversible-jump Markov Chain Monte Carlo to move between model subspaces that vary in the number of distinct diversification regimes. The model assumes that changes in evolutionary regimes occur across the branches of phylogenetic trees under a compound Poisson process and explicitly accounts for rate variation through time and among lineages. Using simulated datasets, I demonstrate that the method can be used to quantify complex mixtures of time-dependent, diversity-dependent, and constant-rate diversification processes. I compared the performance of the method to the MEDUSA model of rate variation among lineages. As an empirical example, I analyzed the history of speciation and extinction during the radiation of modern whales. The method described here will greatly facilitate the exploration of macroevolutionary dynamics across large phylogenetic trees, which may have been shaped by heterogeneous mixtures of distinct evolutionary processes.
The sequence of amino acids in a protein is believed to determine its native state structure, which in turn is related to the functionality of the protein. In addition, information pertaining to evolutionary relationships is contained in homologous sequences. One powerful method for inferring these sequence attributes is through comparison of a query sequence with reference sequences that contain significant homology and whose structure, function, and/or evolutionary relationships are already known. In spite of decades of concerted work, there is no simple framework for deducing structure, function, and evolutionary (SF&E) relationships directly from sequence information alone, especially when the pair-wise identity is less than a threshold figure ~25% [1,2]. However, recent research has shown that sequence identity as low as 8% is sufficient to yield common structure/function relationships and sequence identities as large as 88% may yet result in distinct structure and function [3,4]. Starting with a basic premise that protein sequence encodes information about SF&E, one might ask how one could tease out these measures in an unbiased manner. Here we present a unified framework for inferring SF&E from sequence information using a knowledge-based approach which generates phylogenetic profiles in an unbiased manner. We illustrate the power of phylogenetic profiles generated using the Gestalt Domain Detection Algorithm Basic Local Alignment Tool (GDDA-BLAST) to derive structural domains, functional annotation, and evolutionary relationships for a host of ion-channels and human proteins of unknown function. These data are in excellent accord with published data and new experiments. Our results suggest that there is a wealth of previously unexplored information in protein sequence.
Because biological processes can make different loci have different evolutionary histories, species tree estimation requires multiple loci from across the genome. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called summary methods. Because summary methods are generally fast, they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate on biologically realistic conditions. Mirarab et al. (Science 2014) presented the statistical binning technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple statistical test for combinability and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomics pipeline does not have the desirable property of being statistically consistent. We show that weighting the recalculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, weighted statistical binning enables highly accurate genome-scale species tree estimation, and is also statistical consistent under the multi-species coalescent model.
An evolutionary tree is a cascade of bifurcations starting from a single common root, generating a growing set of daughter species as time goes by. Species here is a general denomination for biological species, spoken languages or any other entity evolving through heredity. From the N currently alive species within a clade, distances are measured through pairwise comparisons made by geneticists, linguists, etc. The larger is such a distance for a pair of species, the older is their last common ancestor. The aim is to reconstruct the past unknown bifurcations, i.e. the whole clade, from the knowledge of the N(N-1)/2 quoted distances taken for granted. A mechanical method is presented, and its applicability discussed.
Phylogenetic Diversity (PD) is a prominent quantitative measure of the biodiversity of a collection of present-day species (taxa). This measure is based on the evolutionary distance among the species in the collection. Loosely speaking, if $mathcal{T}$ is a rooted phylogenetic tree whose leaf set $X$ represents a set of species and whose edges have real-valued lengths (weights), then the PD score of a subset $S$ of $X$ is the sum of the weights of the edges of the minimal subtree of $mathcal{T}$ connecting the species in $S$. In this paper, we define several natural variants of the PD score for a subset of taxa which are related by a known rooted phylogenetic network. Under these variants, we explore, for a positive integer $k$, the computational complexity of determining the maximum PD score over all subsets of taxa of size $k$ when the input is restricted to different classes of rooted phylogenetic networks
In this paper, decision theory was used to derive Bayes and minimax decision rules to estimate allelic frequencies and to explore their admissibility. Decision rules with uniformly smallest risk usually do not exist and one approach to solve this problem is to use the Bayes principle and the minimax principle to find decision rules satisfying some general optimality criterion based on their risk functions. Two cases were considered, the simpler case of biallelic loci and the more complex case of multiallelic loci. For each locus, the sampling model was a multinomial distribution and the prior was a Beta (biallelic case) or a Dirichlet (multiallelic case) distribution. Three loss functions were considered: squared error loss (SEL), Kulback-Leibler loss (KLL) and quadratic error loss (QEL). Bayes estimators were derived under these three loss functions and were subsequently used to find minimax estimators using results from decision theory. The Bayes estimators obtained from SEL and KLL turned out to be the same. Under certain conditions, the Bayes estimator derived from QEL led to an admissible minimax estimator (which was also equal to the maximum likelihood estimator). The SEL also allowed finding admissible minimax estimators. Some estimators had uniformly smaller variance than the MLE and under suitable conditions the remaining estimators also satisfied this property. In addition to their statistical properties, the estimators derived here allow variation in allelic frequencies, which is closer to the reality of finite populations exposed to evolutionary forces.