Gapped Indexing for Consecutive Occurrences

60 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Teresa Anna Steiner

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Philip Bille - Inge Li G{o}rtz - Max Rish{o}j Pedersen

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range [alpha,beta] we can quickly find the consecutive occurrences of P1 and P2 with distance in [alpha,beta], i.e., pairs of occurrences immediately following each other and with distance within the range. We present data structures that use ~O(n) space and query time ~O(|P1|+|P2|+n^(2/3)) for existence and counting and ~O(|P1|+|P2|+n^(2/3)*occ^(1/3)) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using ~O(n) space must use tilde{Omega}}(|P1|+|P2|+sqrt{n}) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

قيم البحث

71 - Philip Bille , Inge Li G{o}rtz , Teresa Anna Steiner 2019

Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in com pressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

بنى وهياكل البيانات والخوارزميات

The Strong 3SUM-INDEXING Conjecture is False

61 - Tsvi Kopelowitz , Ely Porat 2019

In the 3SUM-Indexing problem the goal is to preprocess two lists of elements from $U$, $A=(a_1,a_2,ldots,a_n)$ and $B=(b_1,b_2,...,b_n)$, such that given an element $cin U$ one can quickly determine whether there exists a pair $(a,b)in A times B$ whe re $a+b=c$. Goldstein et al.~[WADS2017] conjectured that there is no algorithm for 3SUM-Indexing which uses $n^{2-Omega(1)}$ space and $n^{1-Omega(1)}$ query time. We show that the conjecture is false by reducing the 3SUM-Indexing problem to the problem of inverting functions, and then applying an algorithm of Fiat and Naor [SICOMP1999] for inverting functions.

بنى وهياكل البيانات والخوارزميات

Indexing and querying color sets of images

145 - Djamal Belazzougui , Roman Kolpakov , Mathieu Raffinot 2016

We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $ngeq m$ columns where each element in the matrix is an integer from $[1,sigma]$ named a {em color}. The set of distinct colors in a region i s called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by ${cal F}$. A rectangle is {em maximal} if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by $mathcal{L}.$ We first explain how to determine all the $|mathcal{L}|$ maximal locations with their fingerprints in expected time $O(nm^2sigma)$ using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic $O(nm^2sigmalog(frac{|mathcal{L}|}{nm^2}+2))$ time. We then show how to build a data structure which occupies $O(nmlog n+mathcal{|L|})$ space such that a query which asks for all the maximal locations with a given fingerprint $f$ can be answered in time $O(|f|+loglog n+k)$, where $k$ is the number of maximal locations with fingerprint $f$. If the query asks only for the presence of the fingerprint, then the space usage becomes $O(nmlog n+|{cal F}|)$ while the query time becomes $O(|f|+loglog n)$. We eventually consider the special case of squared regions (squares).

بنى وهياكل البيانات والخوارزميات

Lightweight Data Indexing and Compression in External Memory

165 - Paolo Ferragina , Travis Gagie , Giovanni Manzini 2009

In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n} $ bits of disk working space while all previous approaches use $Th{n log n}$ bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the BWT that uses $Th{n}$ bits of working space, and a lightweight {em internal-memory} algorithm for computing the BWT which is the fastest in the literature when the available working space is $os{n}$ bits. Finally, we prove {em lower} bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internal-memory space $times$ number of passes over the disk data.

بنى وهياكل البيانات والخوارزميات

Linear-time string indexing and analysis in small space

56 - Djamal Belazzougui , Fabio Cunial , Juha Karkkainen 2016

The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and appli cations of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $Tin {1,ldots,sigma}^n$ can be built in deterministic $O(n)$ time using just $O(nlog{sigma})$ bits of space, where $sigma leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(nlog{sigma})$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(nlog{sigma})$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(nlog{sigma})$ bits of space, took $O(nlog{log{sigma}})$ time for the first two structures, and $O(nlog^{epsilon}n)$ time for the third, where $epsilon$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات