Approximating Text-to-Pattern Hamming Distances

77 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tomasz Kociumaka

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Timothy M. Chan - Shay Golan - Tomasz Kociumaka

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size $sigma$, compute the Hamming distance between the pattern and the text at every location. Several $(1+epsilon)$-approximation algorithms have been proposed in the literature, with running time of the form $O(epsilon^{-O(1)}nlog nlog m)$, all using fast Fourier transform (FFT). We describe a simple $(1+epsilon)$-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results: - We obtain the first linear-time approximation algorithm; the running time is $O(epsilon^{-2}n)$. - We obtain a faster exact algorithm computing all Hamming distances up to a given threshold k; its running time improves previous results by logarithmic factors and is linear if $klesqrt m$. - We obtain approximation algorithms with better $epsilon$-dependence using rectangular matrix multiplication. The time-bound is $~O(n)$ when the pattern is sufficiently long: $mge epsilon^{-28}$. Previous algorithms require $~O(epsilon^{-1}n)$ time. - When k is not too small, we obtain a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in $O((n/k^{Omega(1)}+occ)n^{o(1)})$ time, where occ is the output size. The algorithm leads to a property tester, returning true if an exact match exists and false if the Hamming distance is more than $delta m$ at every location, running in $~O(delta^{-1/3}n^{2/3}+delta^{-1}n/m)$ time. - We obtain a streaming algorithm to report all locations with Hamming distance approximately less than k, using $~O(epsilon^{-2}sqrt k)$ space. Previously, streaming algorithms were known for the exact problem with ~O(k) space or for the approximate problem with $~O(epsilon^{-O(1)}sqrt m)$ space.

قيم البحث

124 - Tomasz Kociumaka , Jakub Radoszewski , Wojciech Rytter 2013

We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to quer ies about occurrences of one subword $x$ in another subword $y$ of a given text, assuming that $|y|=mathcal{O}(|x|)$, which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding $delta$-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed $delta$ we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.

بنى وهياكل البيانات والخوارزميات

Approximating the Largest Root and Applications to Interlacing Families

102 - Nima Anari , Shayan Oveis Gharan , Amin Saberi 2017

We study the problem of approximating the largest root of a real-rooted polynomial of degree $n$ using its top $k$ coefficients and give nearly matching upper and lower bounds. We present algorithms with running time polynomial in $k$ that use the to p $k$ coefficients to approximate the maximum root within a factor of $n^{1/k}$ and $1+O(tfrac{log n}{k})^2$ when $kleq log n$ and $k>log n$ respectively. We also prove corresponding information-theoretic lower bounds of $n^{Omega(1/k)}$ and $1+Omegaleft(frac{log frac{2n}{k}}{k}right)^2$, and show strong lower bounds for noisy version of the problem in which one is given access to approximate coefficients. This problem has applications in the context of the method of interlacing families of polynomials, which was used for proving the existence of Ramanujan graphs of all degrees, the solution of the Kadison-Singer problem, and bounding the integrality gap of the asymmetric traveling salesman problem. All of these involve computing the maximum root of certain real-rooted polynomials for which the top few coefficients are accessible in subexponential time. Our results yield an algorithm with the running time of $2^{tilde O(sqrt[3]n)}$ for all of them.

بنى وهياكل البيانات والخوارزميات التوافقية

Approximate Hamming distance in a stream

82 - Raphael Clifford , Tatiana Starikovskaya 2016

We consider the problem of computing a $(1+epsilon)$-approximation of the Hamming distance between a pattern of length $n$ and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an $O(epsilon^{-4} log^2 n)$ bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an $O(epsilon^{-2}sqrt{n}log n)$ bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for $(1+epsilon)$-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an $O(epsilon^{-3} sqrt{n} log^{2} n)$ space and $O(epsilon^{-2} log{n})$ time streaming $(1+epsilon)$-approximate Hamming distance algorithm. (2) For general input alphabets there is an $O(epsilon^{-5} sqrt{n} log^{4} n)$ space and $O(epsilon^{-4} log^3 {n})$ time streaming $(1+epsilon)$-approximate Hamming distance algorithm.

بنى وهياكل البيانات والخوارزميات

A New Test for Hamming-Weight Dependencies

101 - David Blackman , Sebastiano Vigna 2021

We describe a new statistical test for pseudorandom number generators (PRNGs). Our test can find bias induced by dependencies among the Hamming weights of the outputs of a PRNG, even for PRNGs that pass state-of-the-art tests of the same kind from th e literature, and in particular for generators based on F_2-linear transformations such as the dSFMT, xoroshiro128+, and WELL512.

بنى وهياكل البيانات والخوارزميات البرمجيات الرياضية

Tight Cell-Probe Bounds for Online Hamming Distance Computation

190 - Raphael Clifford , Markus Jalsenius , Benjamin Sach 2012

We show tight bounds for online Hamming distance computation in the cell-probe model with word size w. The task is to output the Hamming distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Omega((d /w)*log n) time on average per output, where d is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.

بنى وهياكل البيانات والخوارزميات التعقيد الحسابي

سجل دخول لتتمكن من نشر تعليقات