String Indexing with Compressed Patterns

72 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Teresa Anna Steiner

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Philip Bille - Inge Li G{o}rtz - Teresa Anna Steiner

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

قيم البحث

56 - Djamal Belazzougui , Fabio Cunial , Juha Karkkainen 2016

The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and appli cations of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $Tin {1,ldots,sigma}^n$ can be built in deterministic $O(n)$ time using just $O(nlog{sigma})$ bits of space, where $sigma leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(nlog{sigma})$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(nlog{sigma})$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(nlog{sigma})$ bits of space, took $O(nlog{log{sigma}})$ time for the first two structures, and $O(nlog^{epsilon}n)$ time for the third, where $epsilon$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.

بنى وهياكل البيانات والخوارزميات

The PGM-index: a multicriteria, compressed and learned approach to data indexing

68 - Paolo Ferragina , Giorgio Vinciguerra 2019

The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give out standing improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they lack any guarantees both in their time and space requirements. We propose the Piecewise Geometric Model index (shortly, PGM-index), which achieves guaranteed I/O-optimality in query operations, learns an optimal number of linear models, and its peculiar recursive construction makes it a purely learned data structure, rather than a hybrid of traditional and learned indexes (such as RMI and FITing-tree). We show that the PGM-index improves the space of the FITing-tree by 63.3% and of the B-tree by more than four orders of magnitude, while achieving their same or even better query time efficiency. We complement this result by proposing three variants of the PGM-index. First, we design a compressed PGM-index that further reduces its space footprint by exploiting the repetitiveness at the level of the learned linear models it is composed of. Second, we design a PGM-index that adapts itself to the distribution of the queries, thus resulting in the first known distribution-aware learned index to date. Finally, given its flexibility in the offered space-time trade-offs, we propose the multicriteria PGM-index that efficiently auto-tune itself in a few seconds over hundreds of millions of keys to the possibly evolving space-time constraints imposed by the application of use. We remark to the reader that this paper is an extended and improved version of our previous paper titled Superseding traditional indexes by orchestrating learning and geometry (arXiv:1903.00507).

بنى وهياكل البيانات والخوارزميات قواعد البيانات استرجاع المعلومات

Gapped Indexing for Consecutive Occurrences

59 - Philip Bille , Inge Li G{o}rtz , Max Rish{o}j Pedersen 2021

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range [alpha,beta] we can quickly find the consecutive occurrences of P1 and P2 with distance in [alpha,beta], i.e., pairs of occurrences immediately following each other and with distance within the range. We present data structures that use ~O(n) space and query time ~O(|P1|+|P2|+n^(2/3)) for existence and counting and ~O(|P1|+|P2|+n^(2/3)*occ^(1/3)) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using ~O(n) space must use tilde{Omega}}(|P1|+|P2|+sqrt{n}) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

بنى وهياكل البيانات والخوارزميات

The Strong 3SUM-INDEXING Conjecture is False

61 - Tsvi Kopelowitz , Ely Porat 2019

In the 3SUM-Indexing problem the goal is to preprocess two lists of elements from $U$, $A=(a_1,a_2,ldots,a_n)$ and $B=(b_1,b_2,...,b_n)$, such that given an element $cin U$ one can quickly determine whether there exists a pair $(a,b)in A times B$ whe re $a+b=c$. Goldstein et al.~[WADS2017] conjectured that there is no algorithm for 3SUM-Indexing which uses $n^{2-Omega(1)}$ space and $n^{1-Omega(1)}$ query time. We show that the conjecture is false by reducing the 3SUM-Indexing problem to the problem of inverting functions, and then applying an algorithm of Fiat and Naor [SICOMP1999] for inverting functions.

بنى وهياكل البيانات والخوارزميات

Indexing and querying color sets of images

145 - Djamal Belazzougui , Roman Kolpakov , Mathieu Raffinot 2016

We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $ngeq m$ columns where each element in the matrix is an integer from $[1,sigma]$ named a {em color}. The set of distinct colors in a region i s called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by ${cal F}$. A rectangle is {em maximal} if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by $mathcal{L}.$ We first explain how to determine all the $|mathcal{L}|$ maximal locations with their fingerprints in expected time $O(nm^2sigma)$ using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic $O(nm^2sigmalog(frac{|mathcal{L}|}{nm^2}+2))$ time. We then show how to build a data structure which occupies $O(nmlog n+mathcal{|L|})$ space such that a query which asks for all the maximal locations with a given fingerprint $f$ can be answered in time $O(|f|+loglog n+k)$, where $k$ is the number of maximal locations with fingerprint $f$. If the query asks only for the presence of the fingerprint, then the space usage becomes $O(nmlog n+|{cal F}|)$ while the query time becomes $O(|f|+loglog n)$. We eventually consider the special case of squared regions (squares).

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات