Practical Near Neighbor Search via Group Testing

166 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Benjamin Coleman

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Joshua Engels - Benjamin Coleman - Anshumali Shrivastava

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbors as positives, non-neighbors as negatives, and approximate membership queries as group tests. We instantiate this framework using distance-sensitive Bloom Filters to Identify Near-Neighbor Groups (FLINNG). We prove that FLINNG has sub-linear query time and show that our algorithm comes with a variety of practical advantages. For example, FLINNG can be constructed in a single pass through the data, consists entirely of efficient integer operations, and does not require any distance computations. We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we find that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory.

قيم البحث

170 - Deepanshu Kush , Aleksandar Nikolov , Haohua Tang 2021

A recent series of papers by Andoni, Naor, Nikolov, Razenshteyn, and Waingarten (STOC 2018, FOCS 2018) has given approximate near neighbour search (NNS) data structures for a wide class of distance metrics, including all norms. In particular, these d ata structures achieve approximation on the order of $p$ for $ell_p^d$ norms with space complexity nearly linear in the dataset size $n$ and polynomial in the dimension $d$, and query time sub-linear in $n$ and polynomial in $d$. The main shortcoming is the exponential in $d$ pre-processing time required for their construction. In this paper, we describe a more direct framework for constructing NNS data structures for general norms. More specifically, we show via an algorithmic reduction that an efficient NNS data structure for a given metric is implied by an efficient average distortion embedding of it into $ell_1$ or into Euclidean space. In particular, the resulting data structures require only polynomial pre-processing time, as long as the embedding can be computed in polynomial time. As a concrete instantiation of this framework, we give an NNS data structure for $ell_p$ with efficient pre-processing that matches the approximation factor, space and query complexity of the aforementioned data structure of Andoni et al. On the way, we resolve a question of Naor (Analysis and Geometry in Metric Spaces, 2014) and provide an explicit, efficiently computable embedding of $ell_p$, for $p ge 2$, into $ell_2$ with (quadratic) average distortion on the order of $p$. We expect our approach to pave the way for constructing efficient NNS data structures for all norms.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية هندسة القياسات

Lower Bounds for Oblivious Near-Neighbor Search

101 - Kasper Green Larsen , Tal Malkin , Omri Weinstein 2019

We prove an $Omega(d lg n/ (lglg n)^2)$ lower bound on the dynamic cell-probe complexity of statistically $mathit{oblivious}$ approximate-near-neighbor search ($mathsf{ANN}$) over the $d$-dimensional Hamming cube. For the natural setting of $d = Thet a(log n)$, our result implies an $tilde{Omega}(lg^2 n)$ lower bound, which is a quadratic improvement over the highest (non-oblivious) cell-probe lower bound for $mathsf{ANN}$. This is the first super-logarithmic $mathit{unconditional}$ lower bound for $mathsf{ANN}$ against general (non black-box) data structures. We also show that any oblivious $mathit{static}$ data structure for decomposable search problems (like $mathsf{ANN}$) can be obliviously dynamized with $O(log n)$ overhead in update and query time, strengthening a classic result of Bentley and Saxe (Algorithmica, 1980).

بنى وهياكل البيانات والخوارزميات التشفير والأمن

Graph Reordering for Cache-Efficient Near Neighbor Search

237 - Benjamin Coleman , Santiago Segarra , Anshumali Shrivastava 2021

Graph search is one of the most successful algorithmic trends in near neighbor search. Several of the most popular and empirically successful algorithms are, at their core, a simple walk along a pruned near neighbor graph. Such algorithms consistentl y perform at the top of industrial speed benchmarks for applications such as embedding search. However, graph traversal applications often suffer from poor memory access patterns, and near neighbor search is no exception to this rule. Our measurements show that popular search indices such as the hierarchical navigable small-world graph (HNSW) can have poor cache miss performance. To address this problem, we apply graph reordering algorithms to near neighbor graphs. Graph reordering is a memory layout optimization that groups commonly-accessed nodes together in memory. We present exhaustive experiments applying several reordering algorithms to a leading graph-based near neighbor method based on the HNSW index. We find that reordering improves the query time by up to 40%, and we demonstrate that the time needed to reorder the graph is negligible compared to the time required to construct the index.

بنى وهياكل البيانات والخوارزميات

Kernel Density Estimation through Density Constrained Near Neighbor Search

117 - Moses Charikar , Michael Kapralov , Navid Nouri 2020

In this paper we revisit the kernel density estimation problem: given a kernel $K(x, y)$ and a dataset of $n$ points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query $q$, a $(1+epsilon)$-approximati on to $mu:=frac1{|P|}sum_{pin P} K(p, q)$. First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time $mu^{-(1+o(1))/4}$ and space complexity $mu^{-(1+o(1))}$ for the Gaussian kernel. Our data dependent approach achieves query time $mu^{-0.173-o(1)}$ and space $mu^{-(1+o(1))}$ for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search.

بنى وهياكل البيانات والخوارزميات

Nearest Neighbor Search for Hyperbolic Embeddings

209 - Xian Wu , Moses Charikar 2020

Embedding into hyperbolic space is emerging as an effective representation technique for datasets that exhibit hierarchical structure. This development motivates the need for algorithms that are able to effectively extract knowledge and insights from datapoints embedded in negatively curved spaces. We focus on the problem of nearest neighbor search, a fundamental problem in data analysis. We present efficient algorithmic solutions that build upon established methods for nearest neighbor search in Euclidean space, allowing for easy adoption and integration with existing systems. We prove theoretical guarantees for our techniques and our experiments demonstrate the effectiveness of our approach on real datasets over competing algorithms.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات