أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Peter F. Stadler

A Linear-Time Algorithm for the Common Refinement of Rooted Phylogenetic Trees on a Common Leaf Set

92 - David Schaller , Marc Hellmuth , Peter F. Stadler 2021

The problem of finding a common refinement of a set of rooted trees with common leaf set $L$ appears naturally in mathematical phylogenetics whenever poorly resolved information on the same taxa from different sources is to be reconciled. This consti tutes a special case of the well-studied supertree problem, where the leaf sets of the input trees may differ. Algorithms that solve the rooted tree compatibility problem are of course applicable to this special case. However, they require sophisticated auxiliary data structures and have a running time of at least $O(k|L|log^2(k|L|))$ for $k$ input trees. Here, we show that the problem can be solved in $O(k|L|)$ time using a simple bottom-up algorithm called LinCR. An implementation of LinCR in Python is freely available at https://github.com/david-schaller/tralda.

بنى وهياكل البيانات والخوارزميات التعقيد الحسابي التوافقية

Compatibility of Partitions, Hierarchies, and Split Systems

77 - Marc Hellmuth , David Schaller , Peter F. Stadler 2021

The question whether a partition $mathcal{P}$ and a hierarchy $mathcal{H}$ or a tree-like split system $mathfrak{S}$ are compatible naturally arises in a wide range of classification problems. In the setting of phylogenetic trees, one asks whether th e sets of $mathcal{P}$ coincide with leaf sets of connected components obtained by deleting some edges from the tree $T$ that represents $mathcal{H}$ or $mathfrak{S}$, respectively. More generally, we ask whether a refinement $T^*$ of $T$ exists such that $T^*$ and $mathcal{P}$ are compatible. We report several characterizations for (refinements of) hierarchies and split systems that are compatible with (sets of) partitions. In addition, we provide a linear-time algorithm to check whether refinements of trees and a given partition are compatible. The latter problem becomes NP-complete but fixed-parameter tractable if a set of partitions is considered instead of a single partition. We finally explore the close relationship of the concept of compatibility and so-called Fitch maps.

الرياضيات المتقطعة التوافقية

From Modular Decomposition Trees to Rooted Median Graphs

83 - Carmen Bruckmann , Peter F. Stadler , Marc Hellmuth 2021

The modular decomposition of a symmetric map $deltacolon Xtimes X to Upsilon$ (or, equivalently, a set of symmetric binary relations, a 2-structure, or an edge-colored undirected graph) is a natural construction to capture key features of $delta$ in labeled trees. A map $delta$ is explained by a vertex-labeled rooted tree $(T,t)$ if the label $delta(x,y)$ coincides with the label of the last common ancestor of $x$ and $y$ in $T$, i.e., if $delta(x,y)=t(mathrm{lca}(x,y))$. Only maps whose modular decomposition does not contain prime nodes, i.e., the symbolic ultrametrics, can be exaplained in this manner. Here we consider rooted median graphs as a generalization to (modular decomposition) trees to explain symmetric maps. We first show that every symmetric map can be explained by extended hypercubes and half-grids. We then derive a a linear-time algorithm that stepwisely resolves prime vertices in the modular decomposition tree to obtain a rooted and labeled median graph that explains a given symmetric map $delta$. We argue that the resulting tree-like median graphs may be of use in phylogenetics as a model of evolutionary relationships.

التوافقية الرياضيات المتقطعة

Indirect Identification of Horizontal Gene Transfer

77 - David Schaller , Manuel Lafond , Peter F. Stadler 2020

Several implicit methods to infer Horizontal Gene Transfer (HGT) focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of a graph, the later-divergence-t ime (LDT) graph, whose vertices correspond to genes colored by their species. We investigate these graphs in the setting of relaxed scenarios, i.e., evolutionary scenarios that encompass all commonly used variants of duplication-transfer-loss scenarios in the literature. We characterize LDT graphs as a subclass of properly vertex-colored cographs, and provide a polynomial-time recognition algorithm as well as an algorithm to construct a relaxed scenario that explains a given LDT. An edge in an LDT graph implies that the two corresponding genes are separated by at least one HGT event. The converse is not true, however. We show that the complete xenology relation is described by an rs-Fitch graph, i.e., a complete multipartite graph satisfying constraints on the vertex coloring. This class of vertex-colored graphs is also recognizable in polynomial time. We finally address the question how much information about all HGT events is contained in LDT graphs with the help of simulations of evolutionary scenarios with a wide range of duplication, loss, and HGT events. In particular, we show that a simple greedy graph editing scheme can be used to efficiently detect HGT events that are implicitly contained in LDT graphs.

السكان والتطور الرياضيات المتقطعة بنى وهياكل البيانات والخوارزميات

Complexity of modification problems for best match graphs

175 - David Schaller , Peter F. Stadler , Marc Hellmuth 2020

Best match graphs (BMGs) are vertex-colored directed graphs that were introduced to model the relationships of genes (vertices) from different species (colors) given an underlying evolutionary tree that is assumed to be unknown. In real-life applicat ions, BMGs are estimated from sequence similarity data. Measurement noise and approximation errors usually result in empirically determined graphs that in general violate characteristic properties of BMGs. The arc modification problems for BMGs aim at correcting such violations and thus provide a means to improve the initial estimates of best match data. We show here that the arc deletion, arc completion and arc editing problems for BMGs are NP-complete and that they can be formulated and solved as integer linear programs. To this end, we provide a novel characterization of BMGs in terms of triples (binary trees on three leaves) and a characterization of BMGs with two colors in terms of forbidden subgraphs.

التعقيد الحسابي الرياضيات المتقطعة السكان والتطور

Complete Characterization of Incorrect Orthology Assignments in Best Match Graphs

99 - David Schaller , Manuela Gei{ss} , Peter F. Stadler 2020

Genome-scale orthology assignments are usually based on reciprocal best matches. In the absence of horizontal gene transfer (HGT), every pair of orthologs forms a reciprocal best match. Incorrect orthology assignments therefore are always false posit ives in the reciprocal best match graph. We consider duplication/loss scenarios and characterize unambiguous false-positive (u-fp) orthology assignments, that is, edges in the best match graphs (BMGs) that cannot correspond to orthologs for any gene tree that explains the BMG. Moreover, we provide a polynomial-time algorithm to identify all u-fp orthology assignments in a BMG. Simulations show that at least $75%$ of all incorrect orthology assignments can be detected in this manner. All results rely only on the structure of the BMGs and not on any a priori knowledge about underlying gene or species trees.

السكان والتطور الرياضيات المتقطعة بنى وهياكل البيانات والخوارزميات

A Simple Data-Adaptive Probabilistic Variant Calling Model

99 - Steve Hoffmann , Peter F. Stadler , Korbinian Strimmer 2014

Background: Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the refer ence genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments. Results: We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background. Conclusions: In simulations we show that our simple proposed model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.

الجينوم تطبيقات الإحصاء

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد