ترغب بنشر مسار تعليمي؟ اضغط هنا

A New Coreset Framework for Clustering

95   0   0.0 ( 0 )
 نشر من قبل David Saulpic
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases.

قيم البحث

اقرأ أيضاً

Let $P$ be a set (called points), $Q$ be a set (called queries) and a function $ f:Ptimes Qto [0,infty)$ (called cost). For an error parameter $epsilon>0$, a set $Ssubseteq P$ with a emph{weight function} $w:P rightarrow [0,infty)$ is an $epsilon$-co reset if $sum_{sin S}w(s) f(s,q)$ approximates $sum_{pin P} f(p,q)$ up to a multiplicative factor of $1pmepsilon$ for every given query $qin Q$. We construct coresets for the $k$-means clustering of $n$ input points, both in an arbitrary metric space and $d$-dimensional Euclidean space. For Euclidean space, we present the first coreset whose size is simultaneously independent of both $d$ and $n$. In particular, this is the first coreset of size $o(n)$ for a stream of $n$ sparse points in a $d ge n$ dimensional space (e.g. adjacency matrices of graphs). We also provide the first generalizations of such coresets for handling outliers. For arbitrary metric spaces, we improve the dependence on $k$ to $k log k$ and present a matching lower bound. For $M$-estimator clustering (special cases include the well-known $k$-median and $k$-means clustering), we introduce a new technique for converting an offline coreset construction to the streaming setting. Our method yields streaming coreset algorithms requiring the storage of $O(S + k log n)$ points, where $S$ is the size of the offline coreset. In comparison, the previous state-of-the-art was the merge-and-reduce technique that required $O(S log^{2a+1} n)$ points, where $a$ is the exponent in the offline constructions dependence on $epsilon^{-1}$. For example, combining our offline and streaming results, we produce a streaming metric $k$-means coreset algorithm using $O(epsilon^{-2} k log k log n)$ points of storage. The previous state-of-the-art required $O(epsilon^{-4} k log k log^{6} n)$ points.
168 - Wai Ming Tai 2020
Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.
Many quantum algorithms for machine learning require access to classical data in superposition. However, for many natural data sets and algorithms, the overhead required to load the data set in superposition can erase any potential quantum speedup ov er classical algorithms. Recent work by Harrow introduces a new paradigm in hybrid quantum-classical computing to address this issue, relying on coresets to minimize the data loading overhead of quantum algorithms. We investigate using this paradigm to perform $k$-means clustering on near-term quantum computers, by casting it as a QAOA optimization instance over a small coreset. We compare the performance of this approach to classical $k$-means clustering both numerically and experimentally on IBM Q hardware. We are able to find data sets where coresets work well relative to random sampling and where QAOA could potentially outperform standard $k$-means on a coreset. However, finding data sets where both coresets and QAOA work well--which is necessary for a quantum advantage over $k$-means on the entire data set--appears to be challenging.
63 - Kui Zhao , Junhao Hua , Ling Yan 2019
While marketing budget allocation has been studied for decades in traditional business, nowadays online business brings much more challenges due to the dynamic environment and complex decision-making process. In this paper, we present a novel unified framework for marketing budget allocation. By leveraging abundant data, the proposed data-driven approach can help us to overcome the challenges and make more informed decisions. In our approach, a semi-black-box model is built to forecast the dynamic market response and an efficient optimization method is proposed to solve the complex allocation task. First, the response in each market-segment is forecasted by exploring historical data through a semi-black-box model, where the capability of logit demand curve is enhanced by neural networks. The response model reveals relationship between sales and marketing cost. Based on the learned model, budget allocation is then formulated as an optimization problem, and we design efficient algorithms to solve it in both continuous and discrete settings. Several kinds of business constraints are supported in one unified optimization paradigm, including cost upper bound, profit lower bound, or ROI lower bound. The proposed framework is easy to implement and readily to handle large-scale problems. It has been successfully applied to many scenarios in Alibaba Group. The results of both offline experiments and online A/B testing demonstrate its effectiveness.
In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. For a broad class of symmetric properties we identify the easy region where empirical estimation works and the difficult region whe re more complex estimators are required. We show that by approximately computing the profile maximum likelihood (PML) distribution cite{ADOS16} in this difficult region we obtain a symmetric property estimation framework that is sample complexity optimal for many properties in a broader parameter regime than previous universal estimation approaches based on PML. The resulting algorithms based on these pseudo PML distributions are also more practical.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا