Improved Circular $k$-Mismatch Sketches

117 0 0.0 ( 0 )

Download Cite

Added by Przemys{\\l}aw Uzna\\'nski

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Shay Golan - Tomasz Kociumaka - Tsvi Kopelowitz

Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The shift distance $mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who independently compute sketches (summaries) $mathtt{sk}(S_1)$ and $mathtt{sk}(S_2)$, respectively, so that upon receiving the two sketches, a third player (decoder) is able to compute (or approximate) $mathsf{sh}(S_1,S_2)$ with high probability. This paper primarily focuses on the more general $k$-mismatch version of the problem, where the decoder is allowed to declare a failure if $mathsf{sh}(S_1,S_2)>k$, where $k$ is a parameter known to all parties. Andoni et al. (STOC13) introduced exact circular $k$-mismatch sketches of size $widetilde{O}(k+D(n))$, where $D(n)$ is the number of divisors of $n$. Andoni et al. also showed that their sketch size is optimal in the class of linear homomorphic sketches. We circumvent this lower bound by designing a (non-linear) exact circular $k$-mismatch sketch of size $widetilde{O}(k)$; this size matches communication-complexity lower bounds. We also design $(1pm varepsilon)$-approximate circular $k$-mismatch sketch of size $widetilde{O}(min(varepsilon^{-2}sqrt{k}, varepsilon^{-1.5}sqrt{n}))$, which improves upon an $widetilde{O}(varepsilon^{-2}sqrt{n})$-size sketch of Crouch and McGregor (APPROX11).

rate research

The streaming $k$-mismatch problem

117 - Raphael Clifford , Tomasz Kociumaka , Ely Porat 2017

We consider the streaming complexity of a fundamental task in approximate pattern matching: the $k$-mismatch problem. It asks to compute Hamming distances between a pattern of length $n$ and all length-$n$ substrings of a text for which the Hamming distance does not exceed a given threshold $k$. In our problem formulation, we report not only the Hamming distance but also, on demand, the full emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the $k$-mismatch problem which uses $O(klog{n}logfrac{n}{k})$ bits of space and spends ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS09. En route to this solution, we also give a deterministic $O( k (log frac{n}{k} + log |Sigma|) )$-bit encoding of all the alignments with Hamming distance at most $k$ of a length-$n$ pattern within a text of length $O(n)$. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.

Data Structures and Algorithms

The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time

79 - Shay Golan , Tomasz Kociumaka , Tsvi Kopelowitz 2020

We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$sigma$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $tilde O(k)$ space and $tilde Obig(sqrt kbig)$ worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is $tilde O(nsqrt k)$, and the fastest known offline algorithm, which costs $tilde Obig(n + minbig(frac{nk}{sqrt m},sigma nbig)big)$ time. Moreover, it is not known whether improvements over the $tilde O(nsqrt k)$ total time are possible when using more than $O(k)$ space. We address these gaps by designing a randomized streaming algorithm for the $k$-mismatch problem that, given an integer parameter $kle s le m$, uses $tilde O(s)$ space and costs $tilde Obig(n+minbig(frac {nk^2}m,frac{nk}{sqrt s},frac{sigma nm}sbig)big)$ total time. For $s=m$, the total runtime becomes $tilde Obig(n + minbig(frac{nk}{sqrt m},sigma nbig)big)$, which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still $tilde Obig(sqrt kbig)$.

Data Structures and Algorithms

Massively Parallel Approximate Distance Sketches

108 - Michael Dinitz , Yasamin Nazari 2018

Data structures that allow efficient distance estimation (distance oracles, distance sketches, etc.) have been extensively studied, and are particularly well studied in centralized models and classical distributed models such as CONGEST. We initiate their study in newer (and arguably more realistic) models of distributed computation: the Congested Clique model and the Massively Parallel Computation (MPC) model. We provide efficient constructions in both of these models, but our core results are for MPC. In MPC we give two main results: an algorithm that constructs stretch/space optimal distance sketches but takes a (small) polynomial number of rounds, and an algorithm that constructs distance sketches with worse stretch but that only takes polylogarithmic rounds. Along the way, we show that other useful combinatorial structures can also be computed in MPC. In particular, one key component we use to construct distance sketches are an MPC construction of the hopsets of Elkin and Neiman (2016). This result has additional applications such as the first polylogarithmic time algorithm for constant approximate single-source shortest paths for weighted graphs in the low memory MPC setting.

Data Structures and Algorithms Distributed Parallel and Cluster Computing

Density Sketches for Sampling and Estimation

134 - Aditya Desai , Benjamin Coleman , Anshumali Shrivastava 2021

We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying data distribution. Thus, analogous to popular generative models, DS allows us to succinctly replace the real-data in almost all machine learning pipelines with synthetic examples drawn from the same distribution as the original data. However, unlike generative models, which do not have any statistical guarantees, DS leads to theoretically sound asymptotically converging consistent estimators of the underlying density function. Density sketches also have many appealing properties making them ideal for large-scale distributed applications. DS construction is an online algorithm. The sketches are additive, i.e., the sum of two sketches is the sketch of the combined data. These properties allow data to be collected from distributed sources, compressed into a density sketch, efficiently transmitted in the sketch form to a central server, merged, and re-sampled into a synthetic database for modeling applications. Thus, density sketches can potentially revolutionize how we store, communicate, and distribute data.

Data Structures and Algorithms Machine Learning Machine Learning

Improved Bounds for Perfect Sampling of $k$-Colorings in Graphs

82 - Siddharth Bhandari , Sayantan Chakraborty 2019

We present a randomized algorithm that takes as input an undirected $n$-vertex graph $G$ with maximum degree $Delta$ and an integer $k > 3Delta$, and returns a random proper $k$-coloring of $G$. The distribution of the coloring is emph{perfectly} uniform over the set of all proper $k$-colorings; the expected running time of the algorithm is $mathrm{poly}(k,n)=widetilde{O}(nDelta^2cdot log(k))$. This improves upon a result of Huber~(STOC 1998) who obtained a polynomial time perfect sampling algorithm for $k>Delta^2+2Delta$. Prior to our work, no algorithm with expected running time $mathrm{poly}(k,n)$ was known to guarantee perfectly sampling with sub-quadratic number of colors in general. Our algorithm (like several other perfect sampling algorithms including Hubers) is based on the Coupling from the Past method. Inspired by the emph{bounding chain} approach, pioneered independently by Huber~(STOC 1998) and Haggstrom & Nelander~(Scand.{} J.{} Statist., 1999), we employ a novel bounding chain to derive our result for the graph coloring problem.

Data Structures and Algorithms Discrete Mathematics