Approximate Nearest Neighbor Search as a Multi-Label Classification Problem

249 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ville Hyv\\\"onen

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Ville Hyvonen - Elias Jaasaari - Teemu Roos

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We formulate approximate nearest neighbor (ANN) search as a multi-label classification task. The implications are twofold. First, tree-based indexes can be searched more efficiently by interpreting them as models to solve this task. Second, in addition to index structures designed specifically for ANN search, any type of classifier can be used as an index.

قيم البحث

93 - Jiaqi Lv , Tianran Wu , Chenglun Peng 2020

Multi-label classification (MLC) studies the problem where each instance is associated with multiple relevant labels, which leads to the exponential growth of output space. MLC encourages a popular framework named label compression (LC) for capturing label dependency with dimension reduction. Nevertheless, most existing LC methods failed to consider the influence of the feature space or misguided by original problematic features, so that may result in performance degeneration. In this paper, we present a compact learning (CL) framework to embed the features and labels simultaneously and with mutual guidance. The proposal is a versatile concept, hence the embedding way is arbitrary and independent of the subsequent learning process. Following its spirit, a simple yet effective implementation called compact multi-label learning (CMLL) is proposed to learn a compact low-dimensional representation for both spaces. CMLL maximizes the dependence between the embedded spaces of the labels and features, and minimizes the loss of label space recovery concurrently. Theoretically, we provide a general analysis for different embedding methods. Practically, we conduct extensive experiments to validate the effectiveness of the proposed method.

التعلم الآلي التعلم الالي

Efficient Autotuning of Hyperparameters in Approximate Nearest Neighbor Search

76 - Elias Jaasaari , Ville Hyvonen , Teemu Roos 2018

Approximate nearest neighbor algorithms are used to speed up nearest neighbor search in a wide array of applications. However, current indexing methods feature several hyperparameters that need to be tuned to reach an acceptable accuracy--speed trade -off. A grid search in the parameter space is often impractically slow due to a time-consuming index-building procedure. Therefore, we propose an algorithm for automatically tuning the hyperparameters of indexing methods based on randomized space-partitioning trees. In particular, we present results using randomized k-d trees, random projection trees and randomized PCA trees. The tuning algorithm adds minimal overhead to the index-building process but is able to find the optimal hyperparameters accurately. We demonstrate that the algorithm is significantly faster than existing approaches, and that the indexing methods used are competitive with the state-of-the-art methods in query time while being faster to build.

بنى وهياكل البيانات والخوارزميات التعلم الآلي التعلم الالي

Compression Network with Transformer for Approximate Nearest Neighbor Search

426 - Haokui Zhang , Wenze Hu , Buzhou Tang 2021

We propose a generic feature compression method for Approximate Nearest Neighbor Search (ANNS) problems, which speeds up existing ANNS methods in a plug-and-play manner. Specifically, we propose a new network structure called Compression Network with Transformer (CNT) to compress the feature into a low dimensional space, and an inhomogeneous neighborhood relationship preserving (INRP) loss that aims to maintain high search accuracy. In CNT, we use multiple compression projections to cast the feature into many low dimensional spaces, and then use transformer to globally optimize these projections such that the features are well compressed following the guidance from our loss function. The loss function is designed to assign high weights on point pairs that are close in original feature space, and keep their distances in projected space. Keeping these distances helps maintain the eventual top-k retrieval accuracy, and down weighting others creates room for feature compression. In experiments, we run our compression method on public datasets, and use the compressed features in graph based, product quantization and scalar quantization based ANNS solutions. Experimental results show that our compression method can significantly improve the efficiency of these methods while preserves or even improves search accuracy, suggesting its broad potential impact on real world applications.

استرجاع المعلومات

A Note on Graph-Based Nearest Neighbor Search

66 - Hongya Wang , Zhizheng Wang , Wei Wang 2020

Nearest neighbor search has found numerous applications in machine learning, data mining and massive data processing systems. The past few years have witnessed the popularity of the graph-based nearest neighbor search paradigm because of its superior ity over the space-partitioning algorithms. While a lot of empirical studies demonstrate the efficiency of graph-based algorithms, not much attention has been paid to a more fundamental question: why graph-based algorithms work so well in practice? And which data property affects the efficiency and how? In this paper, we try to answer these questions. Our insight is that the probability that the neighbors of a point o tends to be neighbors in the KNN graph is a crucial data property for query efficiency. For a given dataset, such a property can be qualitatively measured by clustering coefficient of the KNN graph. To show how clustering coefficient affects the performance, we identify that, instead of the global connectivity, the local connectivity around some given query q has more direct impact on recall. Specifically, we observed that high clustering coefficient makes most of the k nearest neighbors of q sit in a maximum strongly connected component (SCC) in the graph. From the algorithmic point of view, we show that the search procedure is actually composed of two phases - the one outside the maximum SCC and the other one in it, which is different from the widely accepted single or multiple paths search models. We proved that the commonly used graph-based search algorithm is guaranteed to traverse the maximum SCC once visiting any point in it. Our analysis reveals that high clustering coefficient leads to large size of the maximum SCC, and thus provides good answer quality with the help of the two-phase search procedure. Extensive empirical results over a comprehensive collection of datasets validate our findings.

التعلم الآلي

Extreme Multi-label Classification from Aggregated Labels

161 - Yanyao Shen , Hsiang-fu Yu , Sujay Sanghavi 2020

Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input, from a very large universe of possible labels. We consider XMC in the setting where labels are available only for groups of samples - but not for ind ividual ones. Current XMC approaches are not built for such multi-instance multi-label (MIML) training data, and MIML approaches do not scale to XMC sizes. We develop a new and scalable algorithm to impute individual-sample labels from the group labels; this can be paired with any existing XMC method to solve the aggregated label problem. We characterize the statistical properties of our algorithm under mild assumptions, and provide a new end-to-end framework for MIML as an extension. Experiments on both aggregated label XMC and MIML tasks show the advantages over existing approaches.

التعلم الآلي التعلم الالي