ترغب بنشر مسار تعليمي؟ اضغط هنا

No substitute for real data: phylogenies from birth-death polytomy resolvers should not be used for many downstream comparative analyses

159   0   0.0 ( 0 )
 نشر من قبل Daniel Rabosky
 تاريخ النشر 2015
  مجال البحث علم الأحياء
والبحث باللغة English
 تأليف Daniel L. Rabosky




اسأل ChatGPT حول البحث

The statistical estimation of phylogenies is always associated with uncertainty, and accommodating this uncertainty is an important component of modern phylogenetic comparative analysis. The birth-death polytomy resolver is a method of accounting for phylogenetic uncertainty that places missing (unsampled) taxa onto phylogenetic trees, using taxonomic information alone. Recent studies of birds and mammals have used this approach to generate pseudo-posterior distributions of phylogenetic trees that are complete at the species level, even in the absence of genetic data for many species. Many researchers have used these distributions of phylogenies for downstream evolutionary analyses that involve inferences on phenotypic evolution, geography, and community assembly. I demonstrate that the use of phylogenies constructed in this fashion is inappropriate for many questions involving traits. Because species are placed on trees at random with respect to trait values, the birth-death polytomy resolver breaks down natural patterns of trait phylogenetic structure. Inferences based on these trees are predictably and often drastically biased in a direction that depends on the underlying (true) pattern of phylogenetic structure in traits. I illustrate the severity of the phenomenon for both continuous and discrete traits using examples from a global bird phylogeny.



قيم البحث

اقرأ أيضاً

The availability of a large number of assembled genomes opens the way to study the evolution of syntenic character within a phylogenetic context. The DeCo algorithm, recently introduced by B{e}rard et al. allows the computation of parsimonious evolut ionary scenarios for gene adjacencies, from pairs of reconciled gene trees. Following the approach pioneered by Sturmfels and Pachter, we describe how to modify the DeCo dynamic programming algorithm to identify classes of cost schemes that generates similar parsimonious evolutionary scenarios for gene adjacencies, as well as the robustness to changes to the cost scheme of evolutionary events of the presence or absence of specific ancestral gene adjacencies. We apply our method to six thousands mammalian gene families, and show that computing the robustness to changes to cost schemes provides new and interesting insights on the evolution of gene adjacencies and the DeCo model.
1. Joint Species Distribution models (JSDMs) explain spatial variation in community composition by contributions of the environment, biotic associations, and possibly spatially structured residual covariance. They show great promise as a general anal ytical framework for community ecology and macroecology, but current JSDMs, even when approximated by latent variables, scale poorly on large datasets, limiting their usefulness for currently emerging big (e.g., metabarcoding and metagenomics) community datasets. 2. Here, we present a novel, more scalable JSDM (sjSDM) that circumvents the need to use latent variables by using a Monte-Carlo integration of the joint JSDM likelihood and allows flexible elastic net regularization on all model components. We implemented sjSDM in PyTorch, a modern machine learning framework that can make use of CPU and GPU calculations. Using simulated communities with known species-species associations and different number of species and sites, we compare sjSDM with state-of-the-art JSDM implementations to determine computational runtimes and accuracy of the inferred species-species and species-environmental associations. 3. We find that sjSDM is orders of magnitude faster than existing JSDM algorithms (even when run on the CPU) and can be scaled to very large datasets. Despite the dramatically improved speed, sjSDM produces more accurate estimates of species association structures than alternative JSDM implementations. We demonstrate the applicability of sjSDM to big community data using eDNA case study with thousands of fungi operational taxonomic units (OTU). 4. Our sjSDM approach makes the analysis of JSDMs to large community datasets with hundreds or thousands of species possible, substantially extending the applicability of JSDMs in ecology. We provide our method in an R package to facilitate its applicability for practical data analysis.
Because biological processes can make different loci have different evolutionary histories, species tree estimation requires multiple loci from across the genome. While many processes can result in discord between gene trees and species trees, incomp lete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called summary methods. Because summary methods are generally fast, they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate on biologically realistic conditions. Mirarab et al. (Science 2014) presented the statistical binning technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple statistical test for combinability and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomics pipeline does not have the desirable property of being statistically consistent. We show that weighting the recalculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, weighted statistical binning enables highly accurate genome-scale species tree estimation, and is also statistical consistent under the multi-species coalescent model.
The technology to generate Spatially Resolved Transcriptomics (SRT) data is rapidly being improved and applied to investigate a variety of biological tissues. The ability to interrogate how spatially localised gene expression can lend new insight to different tissue development is critical, but the appropriate tools to analyse this data are still emerging. This chapter reviews available packages and pipelines for the analysis of different SRT datasets with a focus on identifying spatially variable genes (SVGs) alongside other aims, while discussing the importance of and challenges in establishing a standardised ground truth in the biological data for benchmarking.
183 - Francesca Bassi 2020
During the current Covid-19 pandemic in Italy, official data are collected with medical swabs following a pure convenience criterion which, at least in an early phase, has privileged the exam of patients showing evident symptoms. However, there are e vidences of a very high proportion of asymptomatic patients (e. g. Aguilar et al., 2020; Chugthai et al, 2020; Li, et al., 2020; Mizumoto et al., 2020a, 2020b and Yelin et al., 2020). In this situation, in order to estimate the real number of infected (and to estimate the lethality rate), it should be necessary to run a properly designed sample survey through which it would be possible to calculate the probability of inclusion and hence draw sound probabilistic inference. Some researchers proposed estimates of the total prevalence based on various approaches, including epidemiologic models, time series and the analysis of data collected in countries that faced the epidemic in earlier time (Brogi et al., 2020). In this paper, we propose to estimate the prevalence of Covid-19 in Italy by reweighting the available official data published by the Istituto Superiore di Sanit`a so as to obtain a more representative sample of the Italian population. Reweighting is a procedure commonly used to artificially modify the sample composition so as to obtain a distribution which is more similar to the population (Valliant et al., 2018). In this paper, we will use post-stratification of the official data, in order to derive the weights necessary for reweighting them using age and gender as post-stratification variables thus obtaining more reliable estimation of prevalence and lethality.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا