Order-Preserving Pattern Matching Indeterminate Strings

301 0 0.0 ( 0 )

Download Cite

Added by Alexandre P Francisco

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Diogo Costa - Luis M. S. Russo - Rui Henriques

Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($mu$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the $mu$OPPM problem when indetermination is observed on the pattern or text. Given two strings with length $m$ and $O(r)$ uncertain characters per string position, we show that the $mu$OPPM problem can be solved in $O(mrlg r)$ time when one string is indeterminate and $rinmathbb{N}^+$. Mappings into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with $mu$OPPM problem proved to be NP-hard in general.

rate research

Covering Problems for Partial Words and for Indeterminate Strings

300 - Maxime Crochemore , Costas S. Iliopoulos , Tomasz Kociumaka 2014

We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a dont care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to $k$, the number of non-solid symbols. For the indeterminate string covering problem we obtain a $2^{O(k log k)} + n k^{O(1)}$-time algorithm. For the partial word covering problem we obtain a $2^{O(sqrt{k}log k)} + nk^{O(1)}$-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no $2^{o(sqrt{k})} n^{O(1)}$-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice.

Data Structures and Algorithms

Pattern Discovery in Colored Strings

201 - Zsuzsanna Liptak , Simon J. Puglisi , Massimiliano Rossi 2020

In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by $mathcal{O}(n^2)$, where $n$ is the length of the string. We introduce a baseline algorithm, running in $mathcal{O}(n^2)$ time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in $mathcal{O}(n^2log n)$ time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

Data Structures and Algorithms

Pattern Matching in Multiple Streams

436 - Raphael Clifford , Markus Jalsenius , Ely Porat 2012

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.

Data Structures and Algorithms

Pattern Matching under Polynomial Transformation

636 - Ayelet Butman , Peter Clifford , Raphael Clifford 2011

We consider a class of pattern matching problems where a normalising transformation is applied at every alignment. Normalised pattern matching plays a key role in fields as diverse as image processing and musical information processing where application specific transformations are often applied to the input. By considering the class of polynomial transformations of the input, we provide fast algorithms and the first lower bounds for both new and old problems. Given a pattern of length m and a longer text of length n where both are assumed to contain integer values only, we first show O(n log m) time algorithms for pattern matching under linear transformations even when wildcard symbols can occur in the input. We then show how to extend the technique to polynomial transformations of arbitrary degree. Next we consider the problem of finding the minimum Hamming distance under polynomial transformation. We show that, for any epsilon>0, there cannot exist an O(n m^(1-epsilon)) time algorithm for additive and linear transformations conditional on the hardness of the classic 3SUM problem. Finally, we consider a version of the Hamming distance problem under additive transformations with a bound k on the maximum distance that need be reported. We give a deterministic O(nk log k) time solution which we then improve by careful use of randomisation to O(n sqrt(k log k) log n) time for sufficiently small k. Our randomised solution outputs the correct answer at every position with high probability.

Data Structures and Algorithms

Streaming Pattern Matching with d Wildcards

72 - Shay Golan , Tsvi Kopelowitz , Ely Porat 2017

In the pattern matching with $d$ wildcards problem one is given a text $T$ of length $n$ and a pattern $P$ of length $m$ that contains $d$ wildcard characters, each denoted by a special symbol $?$. A wildcard character matches any other character. The goal is to establish for each $m$-length substring of $T$ whether it matches $P$. In the streaming model variant of the pattern matching with $d$ wildcards problem the text $T$ arrives one character at a time and the goal is to report, before the next character arrives, if the last $m$ characters match $P$ while using only $o(m)$ words of space. In this paper we introduce two new algorithms for the $d$ wildcard pattern matching problem in the streaming model. The first is a randomized Monte Carlo algorithm that is parameterized by a constant $0leq delta leq 1$. This algorithm uses $tilde{O}(d^{1-delta})$ amortized time per character and $tilde{O}(d^{1+delta})$ words of space. The second algorithm, which is used as a black box in the first algorithm, is a randomized Monte Carlo algorithm which uses $O(d+log m)$ worst-case time per character and $O(dlog m)$ words of space.

Data Structures and Algorithms