Optimal Coreset for Gaussian Kernel Density Estimation

169 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Wai Ming Tai

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Wai Ming Tai

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.

قيم البحث

126 - Jasper C.H. Lee , Jerry Li , Christopher Musco 2019

Given points $p_1, dots, p_n$ in $mathbb{R}^d$, how do we find a point $x$ which maximizes $frac{1}{n} sum_{i=1}^n e^{-|p_i - x|^2}$? In other words, how do we find the maximizing point, or mode of a Gaussian kernel density estimation (KDE) centered at $p_1, dots, p_n$? Given the power of KDEs in representing probability distributions and other continuous functions, the basic mode finding problem is widely applicable. However, it is poorly understood algorithmically. Few provable algorithms are known, so practitioners rely on heuristics like the mean-shift algorithm, which are not guaranteed to find a global optimum. We address this challenge by providing fast and provably accurate approximation algorithms for mode finding in both the low and high dimensional settings. For low dimension $d$, our main contribution is to reduce the mode finding problem to a solving a small number of systems of polynomial inequalities. For high dimension $d$, we prove the first dimensionality reduction result for KDE mode finding, which allows for reduction to the low dimensional case. Our result leverages Johnson-Lindenstrauss random projection, Kirszbrauns classic extension theorem, and perhaps surprisingly, the mean-shift heuristic for mode finding.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية

Near-Optimal Coresets of Kernel Density Estimates

238 - Jeff M. Phillips , Wai Ming Tai 2018

We construct near-optimal coresets for kernel density estimates for points in $mathbb{R}^d$ when the kernel is positive definite. Specifically we show a polynomial time construction for a coreset of size $O(sqrt{d}/varepsiloncdot sqrt{log 1/varepsilo n} )$, and we show a near-matching lower bound of size $Omega(min{sqrt{d}/varepsilon, 1/varepsilon^2})$. When $dgeq 1/varepsilon^2$, it is known that the size of coreset can be $O(1/varepsilon^2)$. The upper bound is a polynomial-in-$(1/varepsilon)$ improvement when $d in [3,1/varepsilon^2)$ and the lower bound is the first known lower bound to depend on $d$ for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide-variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative.

التعلم الآلي الهندسة الحسابية التعلم الالي

Kernel Density Estimation through Density Constrained Near Neighbor Search

117 - Moses Charikar , Michael Kapralov , Navid Nouri 2020

In this paper we revisit the kernel density estimation problem: given a kernel $K(x, y)$ and a dataset of $n$ points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query $q$, a $(1+epsilon)$-approximati on to $mu:=frac1{|P|}sum_{pin P} K(p, q)$. First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time $mu^{-(1+o(1))/4}$ and space complexity $mu^{-(1+o(1))}$ for the Gaussian kernel. Our data dependent approach achieves query time $mu^{-0.173-o(1)}$ and space $mu^{-(1+o(1))}$ for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search.

بنى وهياكل البيانات والخوارزميات

Improved Coresets for Kernel Density Estimates

115 - Jeff M. Phillips , Wai Ming Tai 2017

We study the construction of coresets for kernel density estimates. That is we show how to approximate the kernel density estimate described by a large point set with another kernel density estimate with a much smaller point set. For characteristic k ernels (including Gaussian and Laplace kernels), our approximation preserves the $L_infty$ error between kernel density estimates within error $epsilon$, with coreset size $2/epsilon^2$, but no other aspects of the data, including the dimension, the diameter of the point set, or the bandwidth of the kernel common to other approximations. When the dimension is unrestricted, we show this bound is tight for these kernels as well as a much broader set. This work provides a careful analysis of the iterative Frank-Wolfe algorithm adapted to this context, an algorithm called emph{kernel herding}. This analysis unites a broad line of work that spans statistics, machine learning, and geometry. When the dimension $d$ is constant, we demonstrate much tighter bounds on the size of the coreset specifically for Gaussian kernels, showing that it is bounded by the size of the coreset for axis-aligned rectangles. Currently the best known constructive bound is $O(frac{1}{epsilon} log^d frac{1}{epsilon})$, and non-constructively, this can be improved by $sqrt{log frac{1}{epsilon}}$. This improves the best constant dimension bounds polynomially for $d geq 3$.

التعلم الآلي الهندسة الحسابية التعلم الالي

Fast multivariate empirical cumulative distribution function with connection to kernel density estimation

62 - Nicolas Langrene , Xavier Warin 2020

This paper revisits the problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets. Computing an ECDF at one evaluation point requires $mathcal{O}(N)$ operations on a dataset composed of $N$ da ta points. Therefore, a direct evaluation of ECDFs at $N$ evaluation points requires a quadratic $mathcal{O}(N^2)$ operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a $mathcal{O}(N{log}N)$ complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a $mathcal{O}(Nlog(N)^{(d-1){vee}1})$ complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general $d$-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, the paper establishes a direct connection between cumulative distribution functions and kernel density estimation (KDE) for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.

بنى وهياكل البيانات والخوارزميات حساب