Optimal Space and Time for Streaming Pattern Matching

141 0 0.0 ( 0 )

Download Cite

Added by Ryan Rossi

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Tung Mai - Anup Rao - Ryan A. Rossi

Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this work, we study longest common substring, pattern matching, and wildcard pattern matching in the asymmetric streaming model. In this streaming model, we have random access to one string and streaming access to the other one. We present streaming algorithms with provable guarantees for these three fundamental problems. In particular, our algorithms for pattern matching improve the upper bound and beat the unconditional lower bounds on the memory of randomized and deterministic streaming algorithms. In addition to this, we present algorithms for wildcard pattern matching in the asymmetric streaming model that have optimal space and time.

rate research

Streaming Pattern Matching with d Wildcards

72 - Shay Golan , Tsvi Kopelowitz , Ely Porat 2017

In the pattern matching with $d$ wildcards problem one is given a text $T$ of length $n$ and a pattern $P$ of length $m$ that contains $d$ wildcard characters, each denoted by a special symbol $?$. A wildcard character matches any other character. The goal is to establish for each $m$-length substring of $T$ whether it matches $P$. In the streaming model variant of the pattern matching with $d$ wildcards problem the text $T$ arrives one character at a time and the goal is to report, before the next character arrives, if the last $m$ characters match $P$ while using only $o(m)$ words of space. In this paper we introduce two new algorithms for the $d$ wildcard pattern matching problem in the streaming model. The first is a randomized Monte Carlo algorithm that is parameterized by a constant $0leq delta leq 1$. This algorithm uses $tilde{O}(d^{1-delta})$ amortized time per character and $tilde{O}(d^{1+delta})$ words of space. The second algorithm, which is used as a black box in the first algorithm, is a randomized Monte Carlo algorithm which uses $O(d+log m)$ worst-case time per character and $O(dlog m)$ words of space.

Data Structures and Algorithms

Time-Optimal Sublinear Algorithms for Matching and Vertex Cover

164 - Soheil Behnezhad 2021

We present a near-tight analysis of the average query complexity -- `a la Nguyen and Onak [FOCS08] -- of the randomized greedy maximal matching algorithm, improving over the bound of Yoshida, Yamamoto and Ito [STOC09]. For any $n$-vertex graph of average degree $bar{d}$, this leads to the following sublinear-time algorithms for estimating the size of maximum matching and minimum vertex cover, all of which are provably time-optimal up to logarithmic factors: $bullet$ A multiplicative $(2+epsilon)$-approximation in $widetilde{O}(n/epsilon^2)$ time using adjacency list queries. This (nearly) matches an $Omega(n)$ time lower bound for any multiplicative approximation and is, notably, the first $O(1)$-approximation that runs in $o(n^{1.5})$ time. $bullet$ A $(2, epsilon n)$-approximation in $widetilde{O}((bar{d} + 1)/epsilon^2)$ time using adjacency list queries. This (nearly) matches an $Omega(bar{d}+1)$ lower bound of Parnas and Ron [TCS07] which holds for any $(O(1), epsilon n)$-approximation, and improves over the bounds of [Yoshida et al. STOC09; Onak et al. SODA12] and [Kapralov et al. SODA20]: The former two take at least quadratic time in the degree which can be as large as $Omega(n^2)$ and the latter obtains a much larger approximation. $bullet$ A $(2, epsilon n)$-approximation in $widetilde{O}(n/epsilon^3)$ time using adjacency matrix queries. This (nearly) matches an $Omega(n)$ time lower bound in this model and improves over the $widetilde{O}(nsqrt{n})$-time $(2, epsilon n)$-approximate algorithm of [Chen, Kannan, and Khanna ICALP20]. It also turns out that any non-trivial multiplicative approximation in the adjacency matrix model requires $Omega(n^2)$ time, so the additive $epsilon n$ error is necessary too. As immediate corollaries, we get improved sublinear time estimators for (variants of) TSP and an improved AMPC algorithm for maximal matching.

Data Structures and Algorithms Distributed Parallel and Cluster Computing

Approximate Online Pattern Matching in Sub-linear Time

90 - Diptarka Chakraborty , Debarati Das , Michal Koucky 2018

We consider the approximate pattern matching problem under edit distance. In this problem we are given a pattern $P$ of length $w$ and a text $T$ of length $n$ over some alphabet $Sigma$, and a positive integer $k$. The goal is to find all the positions $j$ in $T$ such that there is a substring of $T$ ending at $j$ which has edit distance at most $k$ from the pattern $P$. Recall, the edit distance between two strings is the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. For a position $t$ in ${1,...,n}$, let $k_t$ be the smallest edit distance between $P$ and any substring of $T$ ending at $t$. In this paper we give a constant factor approximation to the sequence $k_1,k_2,...,k_{n}$. We consider both offline and online settings. In the offline setting, where both $P$ and $T$ are available, we present an algorithm that for all $t$ in ${1,...,n}$, computes the value of $k_t$ approximately within a constant factor. The worst case running time of our algorithm is $O(n w^{3/4})$. As a consequence we break the $O(nw)$-time barrier for this problem. In the online setting, we are given $P$ and then $T$ arrives one symbol at a time. We design an algorithm that upon arrival of the $t$-th symbol of $T$ computes $k_t$ approximately within $O(1)$-multiplicative factor and $w^{8/9}$-additive error. Our algorithm takes $O(w^{1-(7/54)})$ amortized time per symbol arrival and takes $O(w^{1-(1/54)})$ additional space apart from storing the pattern $P$. Both of our algorithms are randomized and produce correct answer with high probability. To the best of our knowledge this is the first worst-case sub-linear (in the length of the pattern) time and sub-linear succinct space algorithm for online approximate pattern matching problem.

Data Structures and Algorithms

The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time

79 - Shay Golan , Tomasz Kociumaka , Tsvi Kopelowitz 2020

We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$sigma$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $tilde O(k)$ space and $tilde Obig(sqrt kbig)$ worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is $tilde O(nsqrt k)$, and the fastest known offline algorithm, which costs $tilde Obig(n + minbig(frac{nk}{sqrt m},sigma nbig)big)$ time. Moreover, it is not known whether improvements over the $tilde O(nsqrt k)$ total time are possible when using more than $O(k)$ space. We address these gaps by designing a randomized streaming algorithm for the $k$-mismatch problem that, given an integer parameter $kle s le m$, uses $tilde O(s)$ space and costs $tilde Obig(n+minbig(frac {nk^2}m,frac{nk}{sqrt s},frac{sigma nm}sbig)big)$ total time. For $s=m$, the total runtime becomes $tilde Obig(n + minbig(frac{nk}{sqrt m},sigma nbig)big)$, which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still $tilde Obig(sqrt kbig)$.

Data Structures and Algorithms

Beating Two-Thirds For Random-Order Streaming Matching

163 - Sepehr Assadi , Soheil Behnezhad 2021

We study the maximum matching problem in the random-order semi-streaming setting. In this problem, the edges of an arbitrary $n$-vertex graph $G=(V, E)$ arrive in a stream one by one and in a random order. The goal is to have a single pass over the stream, use $n cdot poly(log n)$ space, and output a large matching of $G$. We prove that for an absolute constant $epsilon_0 > 0$, one can find a $(2/3 + epsilon_0)$-approximate maximum matching of $G$ using $O(n log n)$ space with high probability. This breaks the natural boundary of $2/3$ for this problem prevalent in the prior work and resolves an open problem of Bernstein [ICALP20] on whether a $(2/3 + Omega(1))$-approximation is achievable.

Data Structures and Algorithms