New Frameworks for Offline and Streaming Coreset Constructions

120 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Harry Lang

تاريخ النشر 2016

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Vladimir Braverman - Dan Feldman - Harry Lang

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Let $P$ be a set (called points), $Q$ be a set (called queries) and a function $ f:Ptimes Qto [0,infty)$ (called cost). For an error parameter $epsilon>0$, a set $Ssubseteq P$ with a emph{weight function} $w:P rightarrow [0,infty)$ is an $epsilon$-coreset if $sum_{sin S}w(s) f(s,q)$ approximates $sum_{pin P} f(p,q)$ up to a multiplicative factor of $1pmepsilon$ for every given query $qin Q$. We construct coresets for the $k$-means clustering of $n$ input points, both in an arbitrary metric space and $d$-dimensional Euclidean space. For Euclidean space, we present the first coreset whose size is simultaneously independent of both $d$ and $n$. In particular, this is the first coreset of size $o(n)$ for a stream of $n$ sparse points in a $d ge n$ dimensional space (e.g. adjacency matrices of graphs). We also provide the first generalizations of such coresets for handling outliers. For arbitrary metric spaces, we improve the dependence on $k$ to $k log k$ and present a matching lower bound. For $M$-estimator clustering (special cases include the well-known $k$-median and $k$-means clustering), we introduce a new technique for converting an offline coreset construction to the streaming setting. Our method yields streaming coreset algorithms requiring the storage of $O(S + k log n)$ points, where $S$ is the size of the offline coreset. In comparison, the previous state-of-the-art was the merge-and-reduce technique that required $O(S log^{2a+1} n)$ points, where $a$ is the exponent in the offline constructions dependence on $epsilon^{-1}$. For example, combining our offline and streaming results, we produce a streaming metric $k$-means coreset algorithm using $O(epsilon^{-2} k log k log n)$ points of storage. The previous state-of-the-art required $O(epsilon^{-4} k log k log^{6} n)$ points.

قيم البحث

اقرأ أيضاً

A New Coreset Framework for Clustering

94 - Vincent Cohen-Addad , David Saulpic , Chris Schwiegelshohn 2021

Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases.

بنى وهياكل البيانات والخوارزميات

Practical Coreset Constructions for Machine Learning

170 - Olivier Bachem , Mario Lucic , Andreas Krause 2017

We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset constructio n for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to $k$-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization.

التعلم الالي

Optimal Coreset for Gaussian Kernel Density Estimation

168 - Wai Ming Tai 2020

Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية التعلم الآلي

Improved Guarantees for Offline Stochastic Matching via new Ordered Contention Resolution Schemes

76 - Brian Brubach , Nathaniel Grammel , Will Ma 2021

Matching is one of the most fundamental and broadly applicable problems across many domains. In these diverse real-world applications, there is often a degree of uncertainty in the input which has led to the study of stochastic matching models. Here, each edge in the graph has a known, independent probability of existing derived from some prediction. Algorithms must probe edges to determine existence and match them irrevocably if they exist. Further, each vertex may have a patience constraint denoting how many of its neighboring edges can be probed. We present new ordered contention resolution schemes yielding improved approximation guarantees for some of the foundational problems studied in this area. For stochastic matching with patience constraints in general graphs, we provide a 0.382-approximate algorithm, significantly improving over the previous best 0.31-approximation of Baveja et al. (2018). When the vertices do not have patience constraints, we describe a 0.432-approximate random order probing algorithm with several corollaries such as an improved guarantee for the Prophet Secretary problem under Edge Arrivals. Finally, for the special case of bipartite graphs with unit patience constraints on one of the partitions, we show a 0.632-approximate algorithm that improves on the recent $1/3$-guarantee of Hikima et al. (2021).

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

Constructions from Dots and Lines

394 - Marko A. Rodriguez , Peter Neubauer 2010

A graph is a data structure composed of dots (i.e. vertices) and lines (i.e. edges). The dots and lines of a graph can be organized into intricate arrangements. The ability for a graph to denote objects and their relationships to one another allow fo r a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not necessarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات