Pattern Matching in Multiple Streams

463 0 0.0 ( 0 )

Download Cite

Added by Markus Jalsenius

Publication date 2012

fields Informatics Engineering

and research's language is English

Authors Raphael Clifford - Markus Jalsenius - Ely Porat

Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.

rate research

Improved Bound for Matching in Random-Order Streams

60 - Aaron Bernstein 2020

We study the problem of computing an approximate maximum cardinality matching in the semi-streaming model when edges arrive in a emph{random} order. In the semi-streaming model, the edges of the input graph G = (V,E) are given as a stream e_1, ..., e_m, and the algorithm is allowed to make a single pass over this stream while using $O(n textrm{polylog}(n))$ space ($m = |E|$ and $n = |V|$). If the order of edges is adversarial, a simple single-pass greedy algorithm yields a $1/2$-approximation in $O(n)$ space; achieving a better approximation in adversarial streams remains an elusive open question. A line of recent work shows that one can improve upon the $1/2$-approximation if the edges of the stream arrive in a random order. The state of the art for this model is two-fold: Assadi et al. [SODA 2019] show how to compute a $2/3(sim.66)$-approximate matching, but the space requirement is $O(n^{1.5} textrm{polylog}(n))$. Very recently, Farhadi et al. [SODA 2020] presented an algorithm with the desired space usage of $O(n textrm{polylog}(n))$, but a worse approximation ratio of $6/11(sim.545)$, or $3/5(=.6)$ in bipartite graphs. In this paper, we present an algorithm that computes a $2/3(sim.66)$-approximate matching using only $O(n log(n))$ space, improving upon both results above. We also note that for adversarial streams, a lower bound of Kapralov [SODA 2013] shows that any algorithm that achieves a $1-1/e(sim.63)$-approximation requires $(n^{1+Omega(1/loglog(n))})$ space. Our result for random-order streams is the first to go beyond the adversarial-order lower bound, thus establishing that computing a maximum matching is provably easier in random-order streams.

Data Structures and Algorithms

Pattern Matching under Polynomial Transformation

668 - Ayelet Butman , Peter Clifford , Raphael Clifford 2011

We consider a class of pattern matching problems where a normalising transformation is applied at every alignment. Normalised pattern matching plays a key role in fields as diverse as image processing and musical information processing where application specific transformations are often applied to the input. By considering the class of polynomial transformations of the input, we provide fast algorithms and the first lower bounds for both new and old problems. Given a pattern of length m and a longer text of length n where both are assumed to contain integer values only, we first show O(n log m) time algorithms for pattern matching under linear transformations even when wildcard symbols can occur in the input. We then show how to extend the technique to polynomial transformations of arbitrary degree. Next we consider the problem of finding the minimum Hamming distance under polynomial transformation. We show that, for any epsilon>0, there cannot exist an O(n m^(1-epsilon)) time algorithm for additive and linear transformations conditional on the hardness of the classic 3SUM problem. Finally, we consider a version of the Hamming distance problem under additive transformations with a bound k on the maximum distance that need be reported. We give a deterministic O(nk log k) time solution which we then improve by careful use of randomisation to O(n sqrt(k log k) log n) time for sufficiently small k. Our randomised solution outputs the correct answer at every position with high probability.

Data Structures and Algorithms

Order-Preserving Pattern Matching Indeterminate Strings

300 - Diogo Costa , Luis M. S. Russo , Rui Henriques 2019

Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($mu$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the $mu$OPPM problem when indetermination is observed on the pattern or text. Given two strings with length $m$ and $O(r)$ uncertain characters per string position, we show that the $mu$OPPM problem can be solved in $O(mrlg r)$ time when one string is indeterminate and $rinmathbb{N}^+$. Mappings into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with $mu$OPPM problem proved to be NP-hard in general.

Data Structures and Algorithms

Streaming Pattern Matching with d Wildcards

72 - Shay Golan , Tsvi Kopelowitz , Ely Porat 2017

In the pattern matching with $d$ wildcards problem one is given a text $T$ of length $n$ and a pattern $P$ of length $m$ that contains $d$ wildcard characters, each denoted by a special symbol $?$. A wildcard character matches any other character. The goal is to establish for each $m$-length substring of $T$ whether it matches $P$. In the streaming model variant of the pattern matching with $d$ wildcards problem the text $T$ arrives one character at a time and the goal is to report, before the next character arrives, if the last $m$ characters match $P$ while using only $o(m)$ words of space. In this paper we introduce two new algorithms for the $d$ wildcard pattern matching problem in the streaming model. The first is a randomized Monte Carlo algorithm that is parameterized by a constant $0leq delta leq 1$. This algorithm uses $tilde{O}(d^{1-delta})$ amortized time per character and $tilde{O}(d^{1+delta})$ words of space. The second algorithm, which is used as a black box in the first algorithm, is a randomized Monte Carlo algorithm which uses $O(d+log m)$ worst-case time per character and $O(dlog m)$ words of space.

Data Structures and Algorithms

Approximate Online Pattern Matching in Sub-linear Time

90 - Diptarka Chakraborty , Debarati Das , Michal Koucky 2018

We consider the approximate pattern matching problem under edit distance. In this problem we are given a pattern $P$ of length $w$ and a text $T$ of length $n$ over some alphabet $Sigma$, and a positive integer $k$. The goal is to find all the positions $j$ in $T$ such that there is a substring of $T$ ending at $j$ which has edit distance at most $k$ from the pattern $P$. Recall, the edit distance between two strings is the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. For a position $t$ in ${1,...,n}$, let $k_t$ be the smallest edit distance between $P$ and any substring of $T$ ending at $t$. In this paper we give a constant factor approximation to the sequence $k_1,k_2,...,k_{n}$. We consider both offline and online settings. In the offline setting, where both $P$ and $T$ are available, we present an algorithm that for all $t$ in ${1,...,n}$, computes the value of $k_t$ approximately within a constant factor. The worst case running time of our algorithm is $O(n w^{3/4})$. As a consequence we break the $O(nw)$-time barrier for this problem. In the online setting, we are given $P$ and then $T$ arrives one symbol at a time. We design an algorithm that upon arrival of the $t$-th symbol of $T$ computes $k_t$ approximately within $O(1)$-multiplicative factor and $w^{8/9}$-additive error. Our algorithm takes $O(w^{1-(7/54)})$ amortized time per symbol arrival and takes $O(w^{1-(1/54)})$ additional space apart from storing the pattern $P$. Both of our algorithms are randomized and produce correct answer with high probability. To the best of our knowledge this is the first worst-case sub-linear (in the length of the pattern) time and sub-linear succinct space algorithm for online approximate pattern matching problem.

Data Structures and Algorithms