ترغب بنشر مسار تعليمي؟ اضغط هنا

Efficient repeat finding via suffix arrays

93   0   0.0 ( 0 )
 نشر من قبل Ver\\'onica Becher
 تاريخ النشر 2013
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We solve the problem of finding interspersed maximal repeats using a suffix array construction. As it is well known, all the functionality of suffix trees can be handled by suffix arrays, gaining practicality. Our solution improves the suffix tree based approaches for the repeat finding problem, being particularly well suited for very large inputs. We prove the corrrectness and complexity of the algorithms.



قيم البحث

اقرأ أيضاً

Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{{overline{T}}})$ words of space, where ${overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(log{log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.
175 - Jack Wang 2012
Let $mathscr O$ be a set of $n$ disjoint obstacles in $mathbb{R}^2$, $mathscr M$ be a moving object. Let $s$ and $l$ denote the starting point and maximum path length of the moving object $mathscr M$, respectively. Given a point $p$ in ${R}^2$, we sa y the point $p$ is achievable for $mathscr M$ such that $pi(s,p)leq l$, where $pi(cdot)$ denotes the shortest path length in the presence of obstacles. One is to find a region $mathscr R$ such that, for any point $pin mathbb{R}^2$, if it is achievable for $mathscr M$, then $pin mathscr R$; otherwise, $p otin mathscr R$. In this paper, we restrict our attention to the case of line-segment obstacles. To tackle this problem, we develop three algorithms. We first present a simpler-version algorithm for the sake of intuition. Its basic idea is to reduce our problem to computing the union of a set of circular visibility regions (CVRs). This algorithm takes $O(n^3)$ time. By analysing its dominant steps, we break through its bottleneck by using the short path map (SPM) technique to obtain those circles (unavailable beforehand), yielding an $O(n^2log n)$ algorithm. Owing to the finding above, the third algorithm also uses the SPM technique. It however, does not continue to construct the CVRs. Instead, it directly traverses each region of the SPM to trace the boundaries, the final algorithm obtains $O(nlog n)$ complexity.
46 - Tomasz Kociumaka 2016
For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queri es optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by $Theta(nlog n)$ product of these time complexities. Next, we extend our queries to support concatenations of $O(1)$ substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing. Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings.
A recent series of papers by Andoni, Naor, Nikolov, Razenshteyn, and Waingarten (STOC 2018, FOCS 2018) has given approximate near neighbour search (NNS) data structures for a wide class of distance metrics, including all norms. In particular, these d ata structures achieve approximation on the order of $p$ for $ell_p^d$ norms with space complexity nearly linear in the dataset size $n$ and polynomial in the dimension $d$, and query time sub-linear in $n$ and polynomial in $d$. The main shortcoming is the exponential in $d$ pre-processing time required for their construction. In this paper, we describe a more direct framework for constructing NNS data structures for general norms. More specifically, we show via an algorithmic reduction that an efficient NNS data structure for a given metric is implied by an efficient average distortion embedding of it into $ell_1$ or into Euclidean space. In particular, the resulting data structures require only polynomial pre-processing time, as long as the embedding can be computed in polynomial time. As a concrete instantiation of this framework, we give an NNS data structure for $ell_p$ with efficient pre-processing that matches the approximation factor, space and query complexity of the aforementioned data structure of Andoni et al. On the way, we resolve a question of Naor (Analysis and Geometry in Metric Spaces, 2014) and provide an explicit, efficiently computable embedding of $ell_p$, for $p ge 2$, into $ell_2$ with (quadratic) average distortion on the order of $p$. We expect our approach to pave the way for constructing efficient NNS data structures for all norms.
Prediction suffix trees (PST) provide an effective tool for sequence modelling and prediction. Current prediction techniques for PSTs rely on exact matching between the suffix of the current sequence and the previously observed sequence. We present a provably correct algorithm for learning a PST with approximate suffix matching by relaxing the exact matching condition. We then present a self-bounded enhancement of our algorithm where the depth of suffix tree grows automatically in response to the model performance on a training sequence. Through experiments on synthetic datasets as well as three real-world datasets, we show that the approximate matching PST results in better predictive performance than the other variants of PST.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا