Computing the rearrangement distance of natural genomes

243 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jens Stoye

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Leonard Bohnenkamper - Marilia D. V. Braga - Daniel Doerr

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The computation of genomic distances has been a very active field of computational comparative genomics over the last 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double-cut and join (DCJ) distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao, Lin and Moret relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an ILP solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes, that have equal numbers of duplicates of any marker. Therefore it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this paper we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao, Lin and Moret to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few ten thousand markers, which we demonstrate on simulated and real data.

قيم البحث

اقرأ أيضاً

Computing the Inversion-Indel Distance

52 - Eyla Willing , Jens Stoye , Marilia D. V. Braga 2019

The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only

بنى وهياكل البيانات والخوارزميات

Natural family-free genomic distance

379 - Diego P. Rubert , Fabio V. Martinez , Marilia D. V. Braga 2020

A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. While the most traditional approaches in this area are family-based, i.e., require the classification of DNA fragments into families, more recently an alternative family-free approach was proposed, and consists of studying the rearrangement distances without prior family assignment. On the one hand the computation of genomic distances in the family-free setting helps to match occurrences of duplicated genes and find homologies, but on the other hand this computation is NP-hard. In this paper, by letting structural rearrangements be represented by the generic double cut and join (DCJ) operation and also allowing insertions and deletions of DNA segments, we propose a new and more general family-free genomic distance, providing an efficient ILP formulation to solve it. Our experiments show that the ILP produces accurate results and can handle not only bacterial genomes, but also fungi and insects, or subsets of chromosomes of mammals and plants.

بنى وهياكل البيانات والخوارزميات

An $O(n log n)$ time Algorithm for computing the Path-length Distance between Trees

119 - David Bryant , Celine Scornavacca 2018

Tree comparison metrics have proven to be an invaluable aide in the reconstruction and analysis of phylogenetic (evolutionary) trees. The path-length distance between trees is a particularly attractive measure as it reflects differences in tree shape as well as differences between branch lengths. The distance equals the sum, over all pairs of taxa, of the squared differences between the lengths of the unique path connecting them in each tree. We describe an $O(n log n)$ time for computing this distance, making extensive use of tree decomposition techniques introduced by Brodal et al. (2004).

بنى وهياكل البيانات والخوارزميات

Computing the largest bond of a graph

68 - Gabriel L. Duarte , Daniel Lokshtanov , Lehilton L. C. Pedrosa 2019

A bond of a graph $G$ is an inclusion-wise minimal disconnecting set of $G$, i.e., bonds are cut-sets that determine cuts $[S,Vsetminus S]$ of $G$ such that $G[S]$ and $G[Vsetminus S]$ are both connected. Given $s,tin V(G)$, an $st$-bond of $G$ is a bond whose removal disconnects $s$ and $t$. Contrasting with the large number of studies related to maximum cuts, there are very few results regarding the largest bond of general graphs. In this paper, we aim to reduce this gap on the complexity of computing the largest bond and the largest $st$-bond of a graph. Although cuts and bonds are similar, we remark that computing the largest bond of a graph tends to be harder than computing its maximum cut. We show that {sc Largest Bond} remains NP-hard even for planar bipartite graphs, and it does not admit a constant-factor approximation algorithm, unless $P = NP$. We also show that {sc Largest Bond} and {sc Largest $st$-Bond} on graphs of clique-width $w$ cannot be solved in time $f(w)times n^{o(w)}$ unless the Exponential Time Hypothesis fails, but they can be solved in time $f(w)times n^{O(w)}$. In addition, we show that both problems are fixed-parameter tractable when parameterized by the size of the solution, but they do not admit polynomial kernels unless NP $subseteq$ coNP/poly.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

On Computing the Hamiltonian Index of Graphs

70 - Geevarghese Philip , Rani M. R. , Subashini R 2019

The $r$-th iterated line graph $L^{r}(G)$ of a graph $G$ is defined by: (i) $L^{0}(G) = G$ and (ii) $L^{r}(G) = L(L^{(r- 1)}(G))$ for $r > 0$, where $L(G)$ denotes the line graph of $G$. The Hamiltonian Index $h(G)$ of $G$ is the smallest $r$ such th at $L^{r}(G)$ has a Hamiltonian cycle. Checking if $h(G) = k$ is NP-hard for any fixed integer $k geq 0$ even for subcubic graphs $G$. We study the parameterized complexity of this problem with the parameter treewidth, $tw(G)$, and show that we can find $h(G)$ in time $O*((1 + 2^{(omega + 3)})^{tw(G)})$ where $omega$ is the matrix multiplication exponent and the $O*$ notation hides polynomial factors in input size. The NP-hard Eulerian Steiner Subgraph problem takes as input a graph $G$ and a specified subset $K$ of terminal vertices of $G$ and asks if $G$ has an Eulerian (that is: connected, and with all vertices of even degree.) subgraph $H$ containing all the terminals. A second result (and a key ingredient of our algorithm for finding $h(G)$) in this work is an algorithm which solves Eulerian Steiner Subgraph in $O*((1 + 2^{(omega + 3)})^{tw(G)})$ time.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

سجل دخول لتتمكن من نشر تعليقات