Approximate Online Pattern Matching in Sub-linear Time

91 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Diptarka Chakraborty

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Diptarka Chakraborty - Debarati Das - Michal Koucky

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider the approximate pattern matching problem under edit distance. In this problem we are given a pattern $P$ of length $w$ and a text $T$ of length $n$ over some alphabet $Sigma$, and a positive integer $k$. The goal is to find all the positions $j$ in $T$ such that there is a substring of $T$ ending at $j$ which has edit distance at most $k$ from the pattern $P$. Recall, the edit distance between two strings is the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. For a position $t$ in ${1,...,n}$, let $k_t$ be the smallest edit distance between $P$ and any substring of $T$ ending at $t$. In this paper we give a constant factor approximation to the sequence $k_1,k_2,...,k_{n}$. We consider both offline and online settings. In the offline setting, where both $P$ and $T$ are available, we present an algorithm that for all $t$ in ${1,...,n}$, computes the value of $k_t$ approximately within a constant factor. The worst case running time of our algorithm is $O(n w^{3/4})$. As a consequence we break the $O(nw)$-time barrier for this problem. In the online setting, we are given $P$ and then $T$ arrives one symbol at a time. We design an algorithm that upon arrival of the $t$-th symbol of $T$ computes $k_t$ approximately within $O(1)$-multiplicative factor and $w^{8/9}$-additive error. Our algorithm takes $O(w^{1-(7/54)})$ amortized time per symbol arrival and takes $O(w^{1-(1/54)})$ additional space apart from storing the pattern $P$. Both of our algorithms are randomized and produce correct answer with high probability. To the best of our knowledge this is the first worst-case sub-linear (in the length of the pattern) time and sub-linear succinct space algorithm for online approximate pattern matching problem.

قيم البحث

اقرأ أيضاً

Set Cover in Sub-linear Time

172 - Piotr Indyk , Sepideh Mahabadi , Ronitt Rubinfeld 2019

We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of $m$ sets over $n$ elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in Har-Peled et al. [2016] to the sub-linear query model, that returns an $alpha$-approximate cover using $tilde{O}(m(n/k)^{1/(alpha-1)} + nk)$ queries to the input, where $k$ denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of $k$, the required number of queries is $tilde{Omega}(m(n/k)^{1/(2alpha)})$, even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require $Omega(nk)$ queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter $k$. On the other hand, we show that this bound is not optimal for larger values of the parameter $k$, as there exists a $(1+varepsilon)$-approximation algorithm with $tilde{O}(mn/kvarepsilon^2)$ queries. We show that this bound is essentially tight for sufficiently small constant $varepsilon$, by establishing a lower bound of $tilde{Omega}(mn/k)$ query complexity.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

Optimal Space and Time for Streaming Pattern Matching

140 - Tung Mai , Anup Rao , Ryan A. Rossi 2021

In this work, we study longest common substring, pattern matching, and wildcard pattern matching in the asymmetric streaming model. In this streaming model, we have random access to one string and streaming access to the other one. We present streami ng algorithms with provable guarantees for these three fundamental problems. In particular, our algorithms for pattern matching improve the upper bound and beat the unconditional lower bounds on the memory of randomized and deterministic streaming algorithms. In addition to this, we present algorithms for wildcard pattern matching in the asymmetric streaming model that have optimal space and time.

بنى وهياكل البيانات والخوارزميات

Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems

153 - Raphael Clifford , Markus Jalsenius , Benjamin Sach 2014

We give cell-probe bounds for the computation of edit distance, Hamming distance, convolution and longest common subsequence in a stream. In this model, a fixed string of $n$ symbols is given and one $delta$-bit symbol arrives at a time in a stream. After each symbol arrives, the distance between the fixed string and a suffix of most recent symbols of the stream is reported. The cell-probe model is perhaps the strongest model of computation for showing data structure lower bounds, subsuming in particular the popular word-RAM model. * We first give an $Omega((delta log n)/(w+loglog n))$ lower bound for the time to give each output for both online Hamming distance and convolution, where $w$ is the word size. This bound relies on a new encoding scheme and for the first time holds even when $w$ is as small as a single bit. * We then consider the online edit distance and longest common subsequence problems in the bit-probe model ($w=1$) with a constant sized input alphabet. We give a lower bound of $Omega(sqrt{log n}/(loglog n)^{3/2})$ which applies for both problems. This second set of results relies both on our new encoding scheme as well as a carefully constructed hard distribution. * Finally, for the online edit distance problem we show that there is an $O((log n)^2/w)$ upper bound in the cell-probe model. This bound gives a contrast to our new lower bound and also establishes an exponential gap between the known cell-probe and RAM model complexities.

بنى وهياكل البيانات والخوارزميات

Pattern Matching in Multiple Streams

179 - Raphael Clifford , Markus Jalsenius , Ely Porat 2012

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a f ixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.

بنى وهياكل البيانات والخوارزميات

Bipartite Matching in Nearly-linear Time on Moderately Dense Graphs

83 - Jan van den Brand , Yin-Tat Lee , Danupon Nanongkai 2020

We present an $tilde O(m+n^{1.5})$-time randomized algorithm for maximum cardinality bipartite matching and related problems (e.g. transshipment, negative-weight shortest paths, and optimal transport) on $m$-edge, $n$-node graphs. For maximum cardina lity bipartite matching on moderately dense graphs, i.e. $m = Omega(n^{1.5})$, our algorithm runs in time nearly linear in the input size and constitutes the first improvement over the classic $O(msqrt{n})$-time [Dinic 1970; Hopcroft-Karp 1971; Karzanov 1973] and $tilde O(n^omega)$-time algorithms [Ibarra-Moran 1981] (where currently $omegaapprox 2.373$). On sparser graphs, i.e. when $m = n^{9/8 + delta}$ for any constant $delta>0$, our result improves upon the recent advances of [Madry 2013] and [Liu-Sidford 2020b, 2020a] which achieve an $tilde O(m^{4/3+o(1)})$ runtime. We obtain these results by combining and advancing recent lines of research in interior point methods (IPMs) and dynamic graph algorithms. First, we simplify and improve the IPM of [v.d.Brand-Lee-Sidford-Song 2020], providing a general primal-dual IPM framework and new sampling-based techniques for handling infeasibility induced by approximate linear system solvers. Second, we provide a simple sublinear-time algorithm for detecting and sampling high-energy edges in electric flows on expanders and show that when combined with recent advances in dynamic expander decompositions, this yields efficient data structures for maintaining the iterates of both [v.d.Brand et al.] and our new IPMs. Combining this general machinery yields a simpler $tilde O(n sqrt{m})$ time algorithm for matching based on the logarithmic barrier function, and our state-of-the-art $tilde O(m+n^{1.5})$ time algorithm for matching based on the [Lee-Sidford 2014] barrier (as regularized in [v.d.Brand et al.]).

بنى وهياكل البيانات والخوارزميات التحسين والتحكم

سجل دخول لتتمكن من نشر تعليقات