Efficient repeat finding via suffix arrays

291 0 0.0 ( 0 )

Download Cite

Added by Ver\\'onica Becher

Publication date 2013

fields Informatics Engineering

and research's language is English

Authors Veronica Becher - Alejandro Deymonnaz - Pablo Ariel Heiber

Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We solve the problem of finding interspersed maximal repeats using a suffix array construction. As it is well known, all the functionality of suffix trees can be handled by suffix arrays, gaining practicality. Our solution improves the suffix tree based approaches for the repeat finding problem, being particularly well suited for very large inputs. We prove the corrrectness and complexity of the algorithms.

rate research

Representing the suffix tree with the CDAWG

115 - Djamal Belazzougui , Fabio Cunial 2017

Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{{overline{T}}})$ words of space, where ${overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(log{log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.

Data Structures and Algorithms

Finding Efficient Region in The Plane with Line segments

382 - Jack Wang 2012

Let $mathscr O$ be a set of $n$ disjoint obstacles in $mathbb{R}^2$, $mathscr M$ be a moving object. Let $s$ and $l$ denote the starting point and maximum path length of the moving object $mathscr M$, respectively. Given a point $p$ in ${R}^2$, we say the point $p$ is achievable for $mathscr M$ such that $pi(s,p)leq l$, where $pi(cdot)$ denotes the shortest path length in the presence of obstacles. One is to find a region $mathscr R$ such that, for any point $pin mathbb{R}^2$, if it is achievable for $mathscr M$, then $pin mathscr R$; otherwise, $p otin mathscr R$. In this paper, we restrict our attention to the case of line-segment obstacles. To tackle this problem, we develop three algorithms. We first present a simpler-version algorithm for the sake of intuition. Its basic idea is to reduce our problem to computing the union of a set of circular visibility regions (CVRs). This algorithm takes $O(n^3)$ time. By analysing its dominant steps, we break through its bottleneck by using the short path map (SPM) technique to obtain those circles (unavailable beforehand), yielding an $O(n^2log n)$ algorithm. Owing to the finding above, the third algorithm also uses the SPM technique. It however, does not continue to construct the CVRs. Instead, it directly traverses each region of the SPM to trace the boundaries, the final algorithm obtains $O(nlog n)$ complexity.

Data Structures and Algorithms Computational Geometry

Minimal Suffix and Rotation of a Substring in Optimal Time

46 - Tomasz Kociumaka 2016

For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by $Theta(nlog n)$ product of these time complexities. Next, we extend our queries to support concatenations of $O(1)$ substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing. Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings.

Data Structures and Algorithms

Near Neighbor Search via Efficient Average Distortion Embeddings

170 - Deepanshu Kush , Aleksandar Nikolov , Haohua Tang 2021

A recent series of papers by Andoni, Naor, Nikolov, Razenshteyn, and Waingarten (STOC 2018, FOCS 2018) has given approximate near neighbour search (NNS) data structures for a wide class of distance metrics, including all norms. In particular, these data structures achieve approximation on the order of $p$ for $ell_p^d$ norms with space complexity nearly linear in the dataset size $n$ and polynomial in the dimension $d$, and query time sub-linear in $n$ and polynomial in $d$. The main shortcoming is the exponential in $d$ pre-processing time required for their construction. In this paper, we describe a more direct framework for constructing NNS data structures for general norms. More specifically, we show via an algorithmic reduction that an efficient NNS data structure for a given metric is implied by an efficient average distortion embedding of it into $ell_1$ or into Euclidean space. In particular, the resulting data structures require only polynomial pre-processing time, as long as the embedding can be computed in polynomial time. As a concrete instantiation of this framework, we give an NNS data structure for $ell_p$ with efficient pre-processing that matches the approximation factor, space and query complexity of the aforementioned data structure of Andoni et al. On the way, we resolve a question of Naor (Analysis and Geometry in Metric Spaces, 2014) and provide an explicit, efficiently computable embedding of $ell_p$, for $p ge 2$, into $ell_2$ with (quadratic) average distortion on the order of $p$. We expect our approach to pave the way for constructing efficient NNS data structures for all norms.

Data Structures and Algorithms Computational Geometry Metric Geometry

Self-Bounded Prediction Suffix Tree via Approximate String Matching

53 - Dongwoo Kim , Christian Walder 2018

Prediction suffix trees (PST) provide an effective tool for sequence modelling and prediction. Current prediction techniques for PSTs rely on exact matching between the suffix of the current sequence and the previously observed sequence. We present a provably correct algorithm for learning a PST with approximate suffix matching by relaxing the exact matching condition. We then present a self-bounded enhancement of our algorithm where the depth of suffix tree grows automatically in response to the model performance on a training sequence. Through experiments on synthetic datasets as well as three real-world datasets, we show that the approximate matching PST results in better predictive performance than the other variants of PST.

Machine Learning Machine Learning