Projection-Cost-Preserving Sketches: Proof Strategies and Constructions

139 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Cameron Musco

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Cameron Musco - Christopher Musco

بنى وهياكل البيانات والخوارزميات التعلم الآلي التحليل العددي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this note we illustrate how common matrix approximation methods, such as random projection and random sampling, yield projection-cost-preserving sketches, as introduced in [FSS13, CEM+15]. A projection-cost-preserving sketch is a matrix approximation which, for a given parameter $k$, approximately preserves the distance of the target matrix to all $k$-dimensional subspaces. Such sketches have applications to scalable algorithms for linear algebra, data science, and machine learning. Our goal is to simplify the presentation of proof techniques introduced in [CEM+15] and [CMM17] so that they can serve as a guide for future work. We also refer the reader to [CYD19], which gives a similar simplified exposition of the proof covered in Section 2.

قيم البحث

134 - Aditya Desai , Benjamin Coleman , Anshumali Shrivastava 2021

We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying data distr ibution. Thus, analogous to popular generative models, DS allows us to succinctly replace the real-data in almost all machine learning pipelines with synthetic examples drawn from the same distribution as the original data. However, unlike generative models, which do not have any statistical guarantees, DS leads to theoretically sound asymptotically converging consistent estimators of the underlying density function. Density sketches also have many appealing properties making them ideal for large-scale distributed applications. DS construction is an online algorithm. The sketches are additive, i.e., the sum of two sketches is the sketch of the combined data. These properties allow data to be collected from distributed sources, compressed into a density sketch, efficiently transmitted in the sketch form to a central server, merged, and re-sampled into a synthetic database for modeling applications. Thus, density sketches can potentially revolutionize how we store, communicate, and distribute data.

بنى وهياكل البيانات والخوارزميات التعلم الآلي التعلم الالي

Massively Parallel Approximate Distance Sketches

108 - Michael Dinitz , Yasamin Nazari 2018

Data structures that allow efficient distance estimation (distance oracles, distance sketches, etc.) have been extensively studied, and are particularly well studied in centralized models and classical distributed models such as CONGEST. We initiate their study in newer (and arguably more realistic) models of distributed computation: the Congested Clique model and the Massively Parallel Computation (MPC) model. We provide efficient constructions in both of these models, but our core results are for MPC. In MPC we give two main results: an algorithm that constructs stretch/space optimal distance sketches but takes a (small) polynomial number of rounds, and an algorithm that constructs distance sketches with worse stretch but that only takes polylogarithmic rounds. Along the way, we show that other useful combinatorial structures can also be computed in MPC. In particular, one key component we use to construct distance sketches are an MPC construction of the hopsets of Elkin and Neiman (2016). This result has additional applications such as the first polylogarithmic time algorithm for constant approximate single-source shortest paths for weighted graphs in the low memory MPC setting.

بنى وهياكل البيانات والخوارزميات النظم الموزعة والتوازية والحوسبة العنقودية

Improved Circular $k$-Mismatch Sketches

116 - Shay Golan , Tomasz Kociumaka , Tsvi Kopelowitz 2020

The shift distance $mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, w hich is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who independently compute sketches (summaries) $mathtt{sk}(S_1)$ and $mathtt{sk}(S_2)$, respectively, so that upon receiving the two sketches, a third player (decoder) is able to compute (or approximate) $mathsf{sh}(S_1,S_2)$ with high probability. This paper primarily focuses on the more general $k$-mismatch version of the problem, where the decoder is allowed to declare a failure if $mathsf{sh}(S_1,S_2)>k$, where $k$ is a parameter known to all parties. Andoni et al. (STOC13) introduced exact circular $k$-mismatch sketches of size $widetilde{O}(k+D(n))$, where $D(n)$ is the number of divisors of $n$. Andoni et al. also showed that their sketch size is optimal in the class of linear homomorphic sketches. We circumvent this lower bound by designing a (non-linear) exact circular $k$-mismatch sketch of size $widetilde{O}(k)$; this size matches communication-complexity lower bounds. We also design $(1pm varepsilon)$-approximate circular $k$-mismatch sketch of size $widetilde{O}(min(varepsilon^{-2}sqrt{k}, varepsilon^{-1.5}sqrt{n}))$, which improves upon an $widetilde{O}(varepsilon^{-2}sqrt{n})$-size sketch of Crouch and McGregor (APPROX11).

بنى وهياكل البيانات والخوارزميات

Hierarchical Clustering via Sketches and Hierarchical Correlation Clustering

102 - Danny Vainstein , Vaggos Chatziafratis , Gui Citovsky 2021

Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization. In particular, two maximization objectives have been defined. Moseley and Wang defined the emph{Revenue} objective to handle similarity information given by a weighted graph on the data points (w.l.o.g., $[0,1]$ weights), while Cohen-Addad et al. defined the emph{Dissimilarity} objective to handle dissimilarity information. In this paper, we prove structural lemmas for both objectives allowing us to convert any HC tree to a tree with constant number of internal nodes while incurring an arbitrarily small loss in each objective. Although the best-known approximations are 0.585 and 0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not all weights are small (i.e., there exist constants $epsilon, delta$ such that the fraction of weights smaller than $delta$, is at most $1 - epsilon$); such instances encompass many metric-based similarity instances, thereby improving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) to handle instances that contain similarity and dissimilarity information simultaneously. For HCC, we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights (analogous to $+/-$ correlation clustering), we again present nearly-optimal approximations.

بنى وهياكل البيانات والخوارزميات

Constructions from Dots and Lines

125 - Marko A. Rodriguez , Peter Neubauer 2010

A graph is a data structure composed of dots (i.e. vertices) and lines (i.e. edges). The dots and lines of a graph can be organized into intricate arrangements. The ability for a graph to denote objects and their relationships to one another allow fo r a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not necessarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.

بنى وهياكل البيانات والخوارزميات