Graph Reordering for Cache-Efficient Near Neighbor Search

238 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Benjamin Coleman

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Benjamin Coleman - Santiago Segarra - Anshumali Shrivastava

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Graph search is one of the most successful algorithmic trends in near neighbor search. Several of the most popular and empirically successful algorithms are, at their core, a simple walk along a pruned near neighbor graph. Such algorithms consistently perform at the top of industrial speed benchmarks for applications such as embedding search. However, graph traversal applications often suffer from poor memory access patterns, and near neighbor search is no exception to this rule. Our measurements show that popular search indices such as the hierarchical navigable small-world graph (HNSW) can have poor cache miss performance. To address this problem, we apply graph reordering algorithms to near neighbor graphs. Graph reordering is a memory layout optimization that groups commonly-accessed nodes together in memory. We present exhaustive experiments applying several reordering algorithms to a leading graph-based near neighbor method based on the HNSW index. We find that reordering improves the query time by up to 40%, and we demonstrate that the time needed to reorder the graph is negligible compared to the time required to construct the index.

قيم البحث

170 - Deepanshu Kush , Aleksandar Nikolov , Haohua Tang 2021

A recent series of papers by Andoni, Naor, Nikolov, Razenshteyn, and Waingarten (STOC 2018, FOCS 2018) has given approximate near neighbour search (NNS) data structures for a wide class of distance metrics, including all norms. In particular, these d ata structures achieve approximation on the order of $p$ for $ell_p^d$ norms with space complexity nearly linear in the dataset size $n$ and polynomial in the dimension $d$, and query time sub-linear in $n$ and polynomial in $d$. The main shortcoming is the exponential in $d$ pre-processing time required for their construction. In this paper, we describe a more direct framework for constructing NNS data structures for general norms. More specifically, we show via an algorithmic reduction that an efficient NNS data structure for a given metric is implied by an efficient average distortion embedding of it into $ell_1$ or into Euclidean space. In particular, the resulting data structures require only polynomial pre-processing time, as long as the embedding can be computed in polynomial time. As a concrete instantiation of this framework, we give an NNS data structure for $ell_p$ with efficient pre-processing that matches the approximation factor, space and query complexity of the aforementioned data structure of Andoni et al. On the way, we resolve a question of Naor (Analysis and Geometry in Metric Spaces, 2014) and provide an explicit, efficiently computable embedding of $ell_p$, for $p ge 2$, into $ell_2$ with (quadratic) average distortion on the order of $p$. We expect our approach to pave the way for constructing efficient NNS data structures for all norms.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية هندسة القياسات

Lower Bounds for Oblivious Near-Neighbor Search

101 - Kasper Green Larsen , Tal Malkin , Omri Weinstein 2019

We prove an $Omega(d lg n/ (lglg n)^2)$ lower bound on the dynamic cell-probe complexity of statistically $mathit{oblivious}$ approximate-near-neighbor search ($mathsf{ANN}$) over the $d$-dimensional Hamming cube. For the natural setting of $d = Thet a(log n)$, our result implies an $tilde{Omega}(lg^2 n)$ lower bound, which is a quadratic improvement over the highest (non-oblivious) cell-probe lower bound for $mathsf{ANN}$. This is the first super-logarithmic $mathit{unconditional}$ lower bound for $mathsf{ANN}$ against general (non black-box) data structures. We also show that any oblivious $mathit{static}$ data structure for decomposable search problems (like $mathsf{ANN}$) can be obliviously dynamized with $O(log n)$ overhead in update and query time, strengthening a classic result of Bentley and Saxe (Algorithmica, 1980).

بنى وهياكل البيانات والخوارزميات التشفير والأمن

Practical Near Neighbor Search via Group Testing

165 - Joshua Engels , Benjamin Coleman , Anshumali Shrivastava 2021

We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbor s as positives, non-neighbors as negatives, and approximate membership queries as group tests. We instantiate this framework using distance-sensitive Bloom Filters to Identify Near-Neighbor Groups (FLINNG). We prove that FLINNG has sub-linear query time and show that our algorithm comes with a variety of practical advantages. For example, FLINNG can be constructed in a single pass through the data, consists entirely of efficient integer operations, and does not require any distance computations. We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we find that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory.

بنى وهياكل البيانات والخوارزميات

Efficient Autotuning of Hyperparameters in Approximate Nearest Neighbor Search

76 - Elias Jaasaari , Ville Hyvonen , Teemu Roos 2018

Approximate nearest neighbor algorithms are used to speed up nearest neighbor search in a wide array of applications. However, current indexing methods feature several hyperparameters that need to be tuned to reach an acceptable accuracy--speed trade -off. A grid search in the parameter space is often impractically slow due to a time-consuming index-building procedure. Therefore, we propose an algorithm for automatically tuning the hyperparameters of indexing methods based on randomized space-partitioning trees. In particular, we present results using randomized k-d trees, random projection trees and randomized PCA trees. The tuning algorithm adds minimal overhead to the index-building process but is able to find the optimal hyperparameters accurately. We demonstrate that the algorithm is significantly faster than existing approaches, and that the indexing methods used are competitive with the state-of-the-art methods in query time while being faster to build.

بنى وهياكل البيانات والخوارزميات التعلم الآلي التعلم الالي

Kernel Density Estimation through Density Constrained Near Neighbor Search

117 - Moses Charikar , Michael Kapralov , Navid Nouri 2020

In this paper we revisit the kernel density estimation problem: given a kernel $K(x, y)$ and a dataset of $n$ points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query $q$, a $(1+epsilon)$-approximati on to $mu:=frac1{|P|}sum_{pin P} K(p, q)$. First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time $mu^{-(1+o(1))/4}$ and space complexity $mu^{-(1+o(1))}$ for the Gaussian kernel. Our data dependent approach achieves query time $mu^{-0.173-o(1)}$ and space $mu^{-(1+o(1))}$ for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات