Improved Coresets for Kernel Density Estimates

116 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Wai Ming Tai

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jeff M. Phillips - Wai Ming Tai

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the construction of coresets for kernel density estimates. That is we show how to approximate the kernel density estimate described by a large point set with another kernel density estimate with a much smaller point set. For characteristic kernels (including Gaussian and Laplace kernels), our approximation preserves the $L_infty$ error between kernel density estimates within error $epsilon$, with coreset size $2/epsilon^2$, but no other aspects of the data, including the dimension, the diameter of the point set, or the bandwidth of the kernel common to other approximations. When the dimension is unrestricted, we show this bound is tight for these kernels as well as a much broader set. This work provides a careful analysis of the iterative Frank-Wolfe algorithm adapted to this context, an algorithm called emph{kernel herding}. This analysis unites a broad line of work that spans statistics, machine learning, and geometry. When the dimension $d$ is constant, we demonstrate much tighter bounds on the size of the coreset specifically for Gaussian kernels, showing that it is bounded by the size of the coreset for axis-aligned rectangles. Currently the best known constructive bound is $O(frac{1}{epsilon} log^d frac{1}{epsilon})$, and non-constructively, this can be improved by $sqrt{log frac{1}{epsilon}}$. This improves the best constant dimension bounds polynomially for $d geq 3$.

قيم البحث

238 - Jeff M. Phillips , Wai Ming Tai 2018

We construct near-optimal coresets for kernel density estimates for points in $mathbb{R}^d$ when the kernel is positive definite. Specifically we show a polynomial time construction for a coreset of size $O(sqrt{d}/varepsiloncdot sqrt{log 1/varepsilo n} )$, and we show a near-matching lower bound of size $Omega(min{sqrt{d}/varepsilon, 1/varepsilon^2})$. When $dgeq 1/varepsilon^2$, it is known that the size of coreset can be $O(1/varepsilon^2)$. The upper bound is a polynomial-in-$(1/varepsilon)$ improvement when $d in [3,1/varepsilon^2)$ and the lower bound is the first known lower bound to depend on $d$ for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide-variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative.

التعلم الآلي الهندسة الحسابية التعلم الالي

The GaussianSketch for Almost Relative Error Kernel Distance

92 - Jeff M. Phillips , Wai Ming Tai 2018

We introduce t

التعلم الآلي الهندسة الحسابية التعلم الالي

Optimal Coreset for Gaussian Kernel Density Estimation

168 - Wai Ming Tai 2020

Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية التعلم الآلي

Finding the Mode of a Kernel Density Estimate

126 - Jasper C.H. Lee , Jerry Li , Christopher Musco 2019

Given points $p_1, dots, p_n$ in $mathbb{R}^d$, how do we find a point $x$ which maximizes $frac{1}{n} sum_{i=1}^n e^{-|p_i - x|^2}$? In other words, how do we find the maximizing point, or mode of a Gaussian kernel density estimation (KDE) centered at $p_1, dots, p_n$? Given the power of KDEs in representing probability distributions and other continuous functions, the basic mode finding problem is widely applicable. However, it is poorly understood algorithmically. Few provable algorithms are known, so practitioners rely on heuristics like the mean-shift algorithm, which are not guaranteed to find a global optimum. We address this challenge by providing fast and provably accurate approximation algorithms for mode finding in both the low and high dimensional settings. For low dimension $d$, our main contribution is to reduce the mode finding problem to a solving a small number of systems of polynomial inequalities. For high dimension $d$, we prove the first dimensionality reduction result for KDE mode finding, which allows for reduction to the low dimensional case. Our result leverages Johnson-Lindenstrauss random projection, Kirszbrauns classic extension theorem, and perhaps surprisingly, the mean-shift heuristic for mode finding.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية

Coresets for Data-efficient Training of Machine Learning Models

99 - Baharan Mirzasoleiman , Jeff Bilmes , Jure Leskovec 2019

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open questi on how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.

التعلم الآلي الذكاء الاصطناعي التعلم الالي