ترغب بنشر مسار تعليمي؟ اضغط هنا

Pattern Discovery in Colored Strings

202   0   0.0 ( 0 )
 نشر من قبل Massimiliano Rossi
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by $mathcal{O}(n^2)$, where $n$ is the length of the string. We introduce a baseline algorithm, running in $mathcal{O}(n^2)$ time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in $mathcal{O}(n^2log n)$ time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

قيم البحث

اقرأ أيضاً

Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($mu$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defin ed by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the $mu$OPPM problem when indetermination is observed on the pattern or text. Given two strings with length $m$ and $O(r)$ uncertain characters per string position, we show that the $mu$OPPM problem can be solved in $O(mrlg r)$ time when one string is indeterminate and $rinmathbb{N}^+$. Mappings into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with $mu$OPPM problem proved to be NP-hard in general.
The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string $S$ of size $N$ compressed by a context-fre e grammar of size $n$ that answers fingerprint queries. That is, given indices $i$ and $j$, the answer to a query is the fingerprint of the substring $S[i,j]$. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get $O(log N)$ query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get $O(log log N)$ query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time $O(log N log lce)$ and $O(log lce loglog lce + loglog N)$ for SLPs and Linear SLPs, respectively. Here, $lce$ denotes the length of the LCE.
We investigate the computational complexity of the following problem. We are given a graph in which each vertex has an initial and a target color. Each pair of adjacent vertices can swap their current colors. Our goal is to perform the minimum number of swaps so that the current and target colors agree at each vertex. When the colors are chosen from {1,2,...,c}, we call this problem c-Colored Token Swapping since the current color of a vertex can be seen as a colored token placed on the vertex. We show that c-Colored Token Swapping is NP-complete for c = 3 even if input graphs are restricted to connected planar bipartite graphs of maximum degree 3. We then show that 2-Colored Token Swapping can be solved in polynomial time for general graphs and in linear time for trees. Besides, we show that, the problem for complete graphs is fixed-parameter tractable when parameterized by the number of colors, while it is known to be NP-complete when the number of colors is unbounded.
We present a number of new results about range searching for colored (or categorical) data: 1. For a set of $n$ colored points in three dimensions, we describe randomized data structures with $O(nmathop{rm polylog}n)$ space that can report the dist inct colors in any query orthogonal range (axis-aligned box) in $O(kmathop{rm polyloglog} n)$ expected time, where $k$ is the number of distinct colors in the range, assuming that coordinates are in ${1,ldots,n}$. Previous data structures require $O(frac{log n}{loglog n} + k)$ query time. Our result also implies improvements in higher constant dimensions. 2. Our data structures can be adapted to halfspace ranges in three dimensions (or circular ranges in two dimensions), achieving $O(klog n)$ expected query time. Previous data structures require $O(klog^2n)$ query time. 3. For a set of $n$ colored points in two dimensions, we describe a data structure with $O(nmathop{rm polylog}n)$ space that can answer colored type-2 range counting queries: report the number of occurrences of every distinct color in a query orthogonal range. The query time is $O(frac{log n}{loglog n} + kloglog n)$, where $k$ is the number of distinct colors in the range. Naively performing $k$ uncolored range counting queries would require $O(kfrac{log n}{loglog n})$ time. Our data structures are designed using a variety of techniques, including colored variants of randomized incremental construction (which may be of independent interest), colored variants of shallow cuttings, and bit-packing tricks.
A string $S[1,n]$ is a power (or tandem repeat) of order $k$ and period $n/k$ if it can decomposed into $k$ consecutive equal-length blocks of letters. Powers and periods are fundamental to string processing, and algorithms for their efficient comput ation have wide application and are heavily studied. Recently, Fici et al. (Proc. ICALP 2016) defined an {em anti-power} of order $k$ to be a string composed of $k$ pairwise-distinct blocks of the same length ($n/k$, called {em anti-period}). Anti-powers are a natural converse to powers, and are objects of combinatorial interest in their own right. In this paper we initiate the algorithmic study of anti-powers. Given a string $S$, we describe an optimal algorithm for locating all substrings of $S$ that are anti-powers of a specified order. The optimality of the algorithm follows form a combinatorial lemma that provides a lower bound on the number of distinct anti-powers of a given order: we prove that a string of length $n$ can contain $Theta(n^2/k)$ distinct anti-powers of order $k$.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا