ترغب بنشر مسار تعليمي؟ اضغط هنا

The Hydrostructure: a Universal Framework for Safe and Complete Algorithms for Genome Assembly

319   0   0.0 ( 0 )
 نشر من قبل Sebastian Schmidt
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Genome assembly is a fundamental problem in Bioinformatics, requiring to reconstruct a source genome from an assembly graph built from a set of reads (short strings sequenced from the genome). A notion of genome assembly solution is that of an arc-covering walk of the graph. Since assembly graphs admit many solutions, the goal is to find what is definitely present in all solutions, or what is safe. Most practical assemblers are based on heuristics having at their core unitigs, namely paths whose internal nodes have unit in-degree and out-degree, and which are clearly safe. The long-standing open problem of finding all the safe parts of the solutions was recently solved by a major theoretical result [RECOMB16]. This safe and complete genome assembly algorithm was followed by other works improving the time bounds, as well as extending the results for different notions of assembly solution. But it remained open whether one can be complete also for models of genome assembly of practical applicability. In this paper we present a universal framework for obtaining safe and complete algorithms which unify the previous results, while also allowing for easy generalisations to assembly problems including many practical aspects. This is based on a novel graph structure, called the hydrostructure of a walk, which highlights the reachability properties of the graph from the perspective of the walk. The hydrostructure allows for simple characterisations of the existing safe walks, and of their new practic



قيم البحث

اقرأ أيضاً

We develop a framework for the rigorous analysis of focused stochastic local search algorithms. These are algorithms that search a state space by repeatedly selecting some constraint that is violated in the current state and moving to a random nearby state that addresses the violation, while hopefully not introducing many new ones. An important class of focused local search algorithms with provable performance guarantees has recently arisen from algorithmizations of the Lov{a}sz Local Lemma (LLL), a non-constructive tool for proving the existence of satisfying states by introducing a background measure on the state space. While powerful, the state transitions of algorithms in this class must be, in a precise sense, perfectly compatible with the background measure. In many applications this is a very restrictive requirement and one needs to step outside the class. Here we introduce the notion of emph{measure distortion} and develop a framework for analyzing arbitrary focused stochastic local search algorithms, recovering LLL algorithmizations as the special case of no distortion. Our framework takes as input an arbitrary such algorithm and an arbitrary probability measure and shows how to use the measure as a yardstick of algorithmic progress, even for algorithms designed independently of the measure.
Given two independent sets $I, J$ of a graph $G$, and imagine that a token (coin) is placed at each vertex of $I$. The Sliding Token problem asks if one could transform $I$ to $J$ via a sequence of elementary steps, where each step requires sliding a token from one vertex to one of its neighbors so that the resulting set of vertices where tokens are placed remains independent. This problem is $mathsf{PSPACE}$-complete even for planar graphs of maximum degree $3$ and bounded-treewidth. In this paper, we show that Sliding Token can be solved efficiently for cactus graphs and block graphs, and give upper bounds on the length of a transformation sequence between any two independent sets of these graph classes. Our algorithms are designed based on two main observations. First, all structures that forbid the existence of a sequence of token slidings between $I$ and $J$, if exist, can be found in polynomial time. A sufficient condition for determining no-instances can be easily derived using this characterization. Second, without such forbidden structures, a sequence of token slidings between $I$ and $J$ does exist. In this case, one can indeed transform $I$ to $J$ (and vice versa) using a polynomial number of token-slides.
Current models for the folding of the human genome see a hierarchy stretching down from chromosome territories, through A/B compartments and TADs (topologically-associating domains), to contact domains stabilized by cohesin and CTCF. However, molecul ar mechanisms underlying this folding, and the way folding affects transcriptional activity, remain obscure. Here we review physical principles driving proteins bound to long polymers into clusters surrounded by loops, and present a parsimonious yet comprehensive model for the way the organization determines function. We argue that clusters of active RNA polymerases and their transcription factors are major architectural features; then, contact domains, TADs, and compartments just reflect one or more loops and clusters. We suggest tethering a gene close to a cluster containing appropriate factors -- a transcription factory -- increases the firing frequency, and offer solutions to many current puzzles concerning the actions of enhancers, super-enhancers, boundaries, and eQTLs (expression quantitative trait loci). As a result, the activity of any gene is directly influenced by the activity of other transcription units around it in 3D space, and this is supported by Brownian-dynamics simulations of transcription factors binding to cognate sites on long polymers.
An edge-coloring of a graph $G$ with colors $1,2,ldots,t$ is an interval $t$-coloring if all colors are used, and the colors of edges incident to each vertex of $G$ are distinct and form an interval of integers. A graph $G$ is interval colorable if i t has an interval $t$-coloring for some positive integer $t$. For an interval colorable graph $G$, $W(G)$ denotes the greatest value of $t$ for which $G$ has an interval $t$-coloring. It is known that the complete graph is interval colorable if and only if the number of its vertices is even. However, the exact value of $W(K_{2n})$ is known only for $n leq 4$. The second author showed that if $n = p2^q$, where $p$ is odd and $q$ is nonnegative, then $W(K_{2n}) geq 4n-2-p-q$. Later, he conjectured that if $n in mathbb{N}$, then $W(K_{2n}) = 4n - 2 - leftlfloorlog_2{n}rightrfloor - left | n_2 right |$, where $left | n_2 right |$ is the number of $1$s in the binary representation of $n$. In this paper we introduce a new technique to construct interval colorings of complete graphs based on their 1-factorizations, which is used to disprove the conjecture, improve lower and upper bounds on $W(K_{2n})$ and determine its exact values for $n leq 12$.
In this short note, we show two NP-completeness results regarding the emph{simultaneous representation problem}, introduced by Lubiw and Jampani. The simultaneous representation problem for a given class of intersection graphs asks if some $k$ graphs can be represented so that every vertex is represented by the same interval in each representation. We prove that it is NP-complete to decide this for the class of interval and circular-arc graphs in the case when $k$ is a part of the input and graphs are not in a sunflower position.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا