ترغب بنشر مسار تعليمي؟ اضغط هنا

Improved Circular $k$-Mismatch Sketches

117   0   0.0 ( 0 )
 نشر من قبل Przemys{\\l}aw Uzna\\'nski
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The shift distance $mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who independently compute sketches (summaries) $mathtt{sk}(S_1)$ and $mathtt{sk}(S_2)$, respectively, so that upon receiving the two sketches, a third player (decoder) is able to compute (or approximate) $mathsf{sh}(S_1,S_2)$ with high probability. This paper primarily focuses on the more general $k$-mismatch version of the problem, where the decoder is allowed to declare a failure if $mathsf{sh}(S_1,S_2)>k$, where $k$ is a parameter known to all parties. Andoni et al. (STOC13) introduced exact circular $k$-mismatch sketches of size $widetilde{O}(k+D(n))$, where $D(n)$ is the number of divisors of $n$. Andoni et al. also showed that their sketch size is optimal in the class of linear homomorphic sketches. We circumvent this lower bound by designing a (non-linear) exact circular $k$-mismatch sketch of size $widetilde{O}(k)$; this size matches communication-complexity lower bounds. We also design $(1pm varepsilon)$-approximate circular $k$-mismatch sketch of size $widetilde{O}(min(varepsilon^{-2}sqrt{k}, varepsilon^{-1.5}sqrt{n}))$, which improves upon an $widetilde{O}(varepsilon^{-2}sqrt{n})$-size sketch of Crouch and McGregor (APPROX11).



قيم البحث

اقرأ أيضاً

We consider the streaming complexity of a fundamental task in approximate pattern matching: the $k$-mismatch problem. It asks to compute Hamming distances between a pattern of length $n$ and all length-$n$ substrings of a text for which the Hamming d istance does not exceed a given threshold $k$. In our problem formulation, we report not only the Hamming distance but also, on demand, the full emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the $k$-mismatch problem which uses $O(klog{n}logfrac{n}{k})$ bits of space and spends ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS09. En route to this solution, we also give a deterministic $O( k (log frac{n}{k} + log |Sigma|) )$-bit encoding of all the alignments with Hamming distance at most $k$ of a length-$n$ pattern within a text of length $O(n)$. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.
We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$sigma$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $tilde O(k)$ space and $tilde Obig(sqrt kbig)$ worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is $tilde O(nsqrt k)$, and the fastest known offline algorithm, which costs $tilde Obig(n + minbig(frac{nk}{sqrt m},sigma nbig)big)$ time. Moreover, it is not known whether improvements over the $tilde O(nsqrt k)$ total time are possible when using more than $O(k)$ space. We address these gaps by designing a randomized streaming algorithm for the $k$-mismatch problem that, given an integer parameter $kle s le m$, uses $tilde O(s)$ space and costs $tilde Obig(n+minbig(frac {nk^2}m,frac{nk}{sqrt s},frac{sigma nm}sbig)big)$ total time. For $s=m$, the total runtime becomes $tilde Obig(n + minbig(frac{nk}{sqrt m},sigma nbig)big)$, which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still $tilde Obig(sqrt kbig)$.
Data structures that allow efficient distance estimation (distance oracles, distance sketches, etc.) have been extensively studied, and are particularly well studied in centralized models and classical distributed models such as CONGEST. We initiate their study in newer (and arguably more realistic) models of distributed computation: the Congested Clique model and the Massively Parallel Computation (MPC) model. We provide efficient constructions in both of these models, but our core results are for MPC. In MPC we give two main results: an algorithm that constructs stretch/space optimal distance sketches but takes a (small) polynomial number of rounds, and an algorithm that constructs distance sketches with worse stretch but that only takes polylogarithmic rounds. Along the way, we show that other useful combinatorial structures can also be computed in MPC. In particular, one key component we use to construct distance sketches are an MPC construction of the hopsets of Elkin and Neiman (2016). This result has additional applications such as the first polylogarithmic time algorithm for constant approximate single-source shortest paths for weighted graphs in the low memory MPC setting.
We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying data distr ibution. Thus, analogous to popular generative models, DS allows us to succinctly replace the real-data in almost all machine learning pipelines with synthetic examples drawn from the same distribution as the original data. However, unlike generative models, which do not have any statistical guarantees, DS leads to theoretically sound asymptotically converging consistent estimators of the underlying density function. Density sketches also have many appealing properties making them ideal for large-scale distributed applications. DS construction is an online algorithm. The sketches are additive, i.e., the sum of two sketches is the sketch of the combined data. These properties allow data to be collected from distributed sources, compressed into a density sketch, efficiently transmitted in the sketch form to a central server, merged, and re-sampled into a synthetic database for modeling applications. Thus, density sketches can potentially revolutionize how we store, communicate, and distribute data.
We present a randomized algorithm that takes as input an undirected $n$-vertex graph $G$ with maximum degree $Delta$ and an integer $k > 3Delta$, and returns a random proper $k$-coloring of $G$. The distribution of the coloring is emph{perfectly} uni form over the set of all proper $k$-colorings; the expected running time of the algorithm is $mathrm{poly}(k,n)=widetilde{O}(nDelta^2cdot log(k))$. This improves upon a result of Huber~(STOC 1998) who obtained a polynomial time perfect sampling algorithm for $k>Delta^2+2Delta$. Prior to our work, no algorithm with expected running time $mathrm{poly}(k,n)$ was known to guarantee perfectly sampling with sub-quadratic number of colors in general. Our algorithm (like several other perfect sampling algorithms including Hubers) is based on the Coupling from the Past method. Inspired by the emph{bounding chain} approach, pioneered independently by Huber~(STOC 1998) and Haggstrom & Nelander~(Scand.{} J.{} Statist., 1999), we employ a novel bounding chain to derive our result for the graph coloring problem.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا