Kernel Density Estimation through Density Constrained Near Neighbor Search

118 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Navid Nouri

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Moses Charikar - Michael Kapralov - Navid Nouri

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper we revisit the kernel density estimation problem: given a kernel $K(x, y)$ and a dataset of $n$ points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query $q$, a $(1+epsilon)$-approximation to $mu:=frac1{|P|}sum_{pin P} K(p, q)$. First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time $mu^{-(1+o(1))/4}$ and space complexity $mu^{-(1+o(1))}$ for the Gaussian kernel. Our data dependent approach achieves query time $mu^{-0.173-o(1)}$ and space $mu^{-(1+o(1))}$ for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search.

قيم البحث

168 - Wai Ming Tai 2020

Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية التعلم الآلي

Lower Bounds for Oblivious Near-Neighbor Search

101 - Kasper Green Larsen , Tal Malkin , Omri Weinstein 2019

We prove an $Omega(d lg n/ (lglg n)^2)$ lower bound on the dynamic cell-probe complexity of statistically $mathit{oblivious}$ approximate-near-neighbor search ($mathsf{ANN}$) over the $d$-dimensional Hamming cube. For the natural setting of $d = Thet a(log n)$, our result implies an $tilde{Omega}(lg^2 n)$ lower bound, which is a quadratic improvement over the highest (non-oblivious) cell-probe lower bound for $mathsf{ANN}$. This is the first super-logarithmic $mathit{unconditional}$ lower bound for $mathsf{ANN}$ against general (non black-box) data structures. We also show that any oblivious $mathit{static}$ data structure for decomposable search problems (like $mathsf{ANN}$) can be obliviously dynamized with $O(log n)$ overhead in update and query time, strengthening a classic result of Bentley and Saxe (Algorithmica, 1980).

بنى وهياكل البيانات والخوارزميات التشفير والأمن

Practical Near Neighbor Search via Group Testing

165 - Joshua Engels , Benjamin Coleman , Anshumali Shrivastava 2021

We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbor s as positives, non-neighbors as negatives, and approximate membership queries as group tests. We instantiate this framework using distance-sensitive Bloom Filters to Identify Near-Neighbor Groups (FLINNG). We prove that FLINNG has sub-linear query time and show that our algorithm comes with a variety of practical advantages. For example, FLINNG can be constructed in a single pass through the data, consists entirely of efficient integer operations, and does not require any distance computations. We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we find that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory.

بنى وهياكل البيانات والخوارزميات

Fast multivariate empirical cumulative distribution function with connection to kernel density estimation

62 - Nicolas Langrene , Xavier Warin 2020

This paper revisits the problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets. Computing an ECDF at one evaluation point requires $mathcal{O}(N)$ operations on a dataset composed of $N$ da ta points. Therefore, a direct evaluation of ECDFs at $N$ evaluation points requires a quadratic $mathcal{O}(N^2)$ operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a $mathcal{O}(N{log}N)$ complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a $mathcal{O}(Nlog(N)^{(d-1){vee}1})$ complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general $d$-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, the paper establishes a direct connection between cumulative distribution functions and kernel density estimation (KDE) for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.

بنى وهياكل البيانات والخوارزميات حساب

Kernel density estimation via diffusion

176 - Z. I. Botev , J. F. Grotowski , D. P. Kroese 2010

We present a new adaptive kernel density estimator based on linear diffusion processes. The proposed estimator builds on existing ideas for adaptive smoothing by incorporating information from a pilot density estimate. In addition, we propose a new p lug-in bandwidth selection method that is free from the arbitrary normal reference rules used by existing methods. We present simulation examples in which the proposed approach outperforms existing methods in terms of accuracy and reliability.

نظرية الإحصاء نظرية الإحصاء

سجل دخول لتتمكن من نشر تعليقات