Fast Exact Max-Kernel Search

439 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Parikshit Ram

تاريخ النشر 2012

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ryan R. Curtin - Parikshit Ram - Alexander G. Gray

بنى وهياكل البيانات والخوارزميات استرجاع المعلومات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The wide applicability of kernels makes the problem of max-kernel search ubiquitous and more general than the usual similarity search in metric spaces. We focus on solving this problem efficiently. We begin by characterizing the inherent hardness of the max-kernel search problem with a novel notion of directional concentration. Following that, we present a method to use an $O(n log n)$ algorithm to index any set of objects (points in $Real^dims$ or abstract objects) directly in the Hilbert space without any explicit feature representations of the objects in this space. We present the first provably $O(log n)$ algorithm for exact max-kernel search using this index. Empirical results for a variety of data sets as well as abstract objects demonstrate up to 4 orders of magnitude speedup in some cases. Extensions for approximate max-kernel search are also presented.

قيم البحث

اقرأ أيضاً

A Fast Max Flow Algorithm

371 - James B. Orlin , Xiao-Yue Gong 2019

In 2013, Orlin proved that the max flow problem could be solved in $O(nm)$ time. His algorithm ran in $O(nm + m^{1.94})$ time, which was the fastest for graphs with fewer than $n^{1.06}$ arcs. If the graph was not sufficiently sparse, the fastest run ning time was an algorithm due to King, Rao, and Tarjan. We describe a new variant of the excess scaling algorithm for the max flow problem whose running time strictly dominates the running time of the algorithm by King et al. Moreover, for graphs in which $m = O(n log n)$, the running time of our algorithm dominates that of King et al. by a factor of $O(loglog n)$.

بنى وهياكل البيانات والخوارزميات التعقيد الحسابي

Fast Exact Algorithms Using Hadamard Product of Polynomials

61 - V.Arvind , Abhranil Chatterjee , Rajit Datta 2018

Let $C$ be an arithmetic circuit of $poly(n)$ size given as input that computes a polynomial $finmathbb{F}[X]$, where $X={x_1,x_2,ldots,x_n}$ and $mathbb{F}$ is any field where the field arithmetic can be performed efficiently. We obtain new algorith ms for the following two problems first studied by Koutis and Williams cite{Kou08, Wi09, KW16}. k-MLC: Compute the sum of the coefficients of all degree-$k$ multilinear monomials in the polynomial $f$. k-MMD: Test if there is a nonzero degree-$k$ multilinear monomial in the polynomial $f$. Our algorithms are based on the fact that the Hadamard product $fcirc S_{n,k}$, is the degree-$k$ multilinear part of $f$, where $S_{n,k}$ is the $k^{th}$ elementary symmetric polynomial. 1. For k-MLC problem, we give a deterministic algorithm of run time $O^*(n^{k/2+clog k})$ (where $c$ is a constant), answering an open question of Koutis and Williams cite[ICALP09]{KW16}. As corollaries, we show $O^*(binom{n}{downarrow k/2})$-time exact counting algorithms for several combinatorial problems: k-Tree, t-Dominating Set, m-Dimensional k-Matching. 2. For k-MMD problem, we give a randomized algorithm of run time $4.32^kcdot poly(n,k)$. Our algorithm uses only $poly(n,k)$ space. This matches the run time of a recent algorithm cite{BDH18} for $k-MMD$ which requires exponential (in $k$) space. Other results include fast deterministic algorithms for k-MLC and k-MMD problems for depth three circuits.

بنى وهياكل البيانات والخوارزميات

Kernel Density Estimation through Density Constrained Near Neighbor Search

117 - Moses Charikar , Michael Kapralov , Navid Nouri 2020

In this paper we revisit the kernel density estimation problem: given a kernel $K(x, y)$ and a dataset of $n$ points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query $q$, a $(1+epsilon)$-approximati on to $mu:=frac1{|P|}sum_{pin P} K(p, q)$. First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time $mu^{-(1+o(1))/4}$ and space complexity $mu^{-(1+o(1))}$ for the Gaussian kernel. Our data dependent approach achieves query time $mu^{-0.173-o(1)}$ and space $mu^{-(1+o(1))}$ for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search.

بنى وهياكل البيانات والخوارزميات

Fast multivariate empirical cumulative distribution function with connection to kernel density estimation

62 - Nicolas Langrene , Xavier Warin 2020

This paper revisits the problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets. Computing an ECDF at one evaluation point requires $mathcal{O}(N)$ operations on a dataset composed of $N$ da ta points. Therefore, a direct evaluation of ECDFs at $N$ evaluation points requires a quadratic $mathcal{O}(N^2)$ operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a $mathcal{O}(N{log}N)$ complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a $mathcal{O}(Nlog(N)^{(d-1){vee}1})$ complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general $d$-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, the paper establishes a direct connection between cumulative distribution functions and kernel density estimation (KDE) for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.

بنى وهياكل البيانات والخوارزميات حساب

Optimal Coreset for Gaussian Kernel Density Estimation

168 - Wai Ming Tai 2020

Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية التعلم الآلي