Internal Pattern Matching Queries in a Text and Applications

338 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tomasz Kociumaka

تاريخ النشر 2013

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tomasz Kociumaka - Jakub Radoszewski - Wojciech Rytter

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword $x$ in another subword $y$ of a given text, assuming that $|y|=mathcal{O}(|x|)$, which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding $delta$-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed $delta$ we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.

قيم البحث

اقرأ أيضاً

Pattern Matching in Multiple Streams

470 - Raphael Clifford , Markus Jalsenius , Ely Porat 2012

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a f ixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.

بنى وهياكل البيانات والخوارزميات

Internal Shortest Absent Word Queries in Constant Time and Linear Space

134 - Golnaz Badkobeh , Panagiotis Charalampopoulos , Dmitry Kosolobov 2021

Given a string $T$ of length $n$ over an alphabet $Sigmasubset {1,2,ldots,n^{O(1)}}$ of size $sigma$, we are to preprocess $T$ so that given a range $[i,j]$, we can return a representation of a shortest string over $Sigma$ that is absent in the fragm ent $T[i]cdots T[j]$ of $T$. We present an $O(n)$-space data structure that answers such queries in constant time and can be constructed in $O(nlog_sigma n)$ time.

بنى وهياكل البيانات والخوارزميات

Pattern Matching under Polynomial Transformation

684 - Ayelet Butman , Peter Clifford , Raphael Clifford 2011

We consider a class of pattern matching problems where a normalising transformation is applied at every alignment. Normalised pattern matching plays a key role in fields as diverse as image processing and musical information processing where applicat ion specific transformations are often applied to the input. By considering the class of polynomial transformations of the input, we provide fast algorithms and the first lower bounds for both new and old problems. Given a pattern of length m and a longer text of length n where both are assumed to contain integer values only, we first show O(n log m) time algorithms for pattern matching under linear transformations even when wildcard symbols can occur in the input. We then show how to extend the technique to polynomial transformations of arbitrary degree. Next we consider the problem of finding the minimum Hamming distance under polynomial transformation. We show that, for any epsilon>0, there cannot exist an O(n m^(1-epsilon)) time algorithm for additive and linear transformations conditional on the hardness of the classic 3SUM problem. Finally, we consider a version of the Hamming distance problem under additive transformations with a bound k on the maximum distance that need be reported. We give a deterministic O(nk log k) time solution which we then improve by careful use of randomisation to O(n sqrt(k log k) log n) time for sufficiently small k. Our randomised solution outputs the correct answer at every position with high probability.

بنى وهياكل البيانات والخوارزميات

Stochastic Matching with Few Queries: New Algorithms and Tools

67 - Soheil Behnezhad , Alireza Farhadi , MohammadTaghi Hajiaghayi andn Nima Reyhani 2018

We consider the following stochastic matching problem on both weighted and unweighted graphs: A graph $G(V, E)$ along with a parameter $p in (0, 1)$ is given in the input. Each edge of $G$ is realized independently with probability $p$. The goal is t o select a degree bounded (dependent only on $p$) subgraph $H$ of $G$ such that the expected size/weight of maximum realized matching of $H$ is close to that of $G$. This model of stochastic matching has attracted significant attention over the recent years due to its various applications. The most fundamental open question is the best approximation factor achievable for such algorithms that, in the literature, are referred to as non-adaptive algorithms. Prior work has identified breaking (near) half-approximation as a barrier for both weighted and unweighted graphs. Our main results are as follows: -- We analyze a simple and clean algorithm and show that for unweighted graphs, it finds an (almost) $4sqrt{2}-5$ ($approx 0.6568$) approximation by querying $O(frac{log (1/p)}{p})$ edges per vertex. This improves over the state-of-the-art $0.5001$ approximate algorithm of Assadi et al. [EC17]. -- We show that the same algorithm achieves a $0.501$ approximation for weighted graphs by querying $O(frac{log (1/p)}{p})$ edges per vertex. This is the first algorithm to break $0.5$ approximation barrier for weighted graphs. It also improves the per-vertex queries of the state-of-the-art by Yamaguchi and Maehara [SODA18] and Behnezhad and Reyhani [EC18]. Our algorithms are fundamentally different from prior works, yet are very simple and natural. For the analysis, we introduce a number of procedures that construct heavy fractional matchings. We consider the new algorithms and our analytical tools to be the main contributions of this paper.

بنى وهياكل البيانات والخوارزميات

Optimal Space and Time for Streaming Pattern Matching

140 - Tung Mai , Anup Rao , Ryan A. Rossi 2021

In this work, we study longest common substring, pattern matching, and wildcard pattern matching in the asymmetric streaming model. In this streaming model, we have random access to one string and streaming access to the other one. We present streami ng algorithms with provable guarantees for these three fundamental problems. In particular, our algorithms for pattern matching improve the upper bound and beat the unconditional lower bounds on the memory of randomized and deterministic streaming algorithms. In addition to this, we present algorithms for wildcard pattern matching in the asymmetric streaming model that have optimal space and time.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات