ترغب بنشر مسار تعليمي؟ اضغط هنا

C-MinHash: Rigorously Reducing $K$ Permutations to Two

78   0   0.0 ( 0 )
 نشر من قبل Ping Li
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.

قيم البحث

اقرأ أيضاً

172 - Xiaoyun Li , Ping Li 2021
Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHas h (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two independent permutations. In this paper, we show that one can actually simply use one permutation. That is, one single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes. Although the theoretical analysis becomes very complicated, we are able to explicitly write down the expression for the expectation of the estimator. The new estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy (mean square errors). An extensive set of experiments are provided to verify our claim for using just one permutation.
Game-theoretic attribution techniques based on Shapley values are used extensively to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the comput ation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi Monte Carlo methods are not well defined on the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel to obtain explicit convergence rates of $O(1/n)$, improving on $O(1/sqrt{n})$ for plain Monte Carlo. The RKHS perspective also leads to quasi Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations.
Bagging, a powerful ensemble method from machine learning, improves the performance of unstable predictors. Although the power of Bagging has been shown mostly in classification problems, we demonstrate the success of employing Bagging in sparse regr ession over the baseline method (L1 minimization). The framework employs the generalized version of the original Bagging with various bootstrap ratios. The performance limits associated with different choices of bootstrap sampling ratio L/m and number of estimates K is analyzed theoretically. Simulation shows that the proposed method yields state-of-the-art recovery performance, outperforming L1 minimization and Bolasso in the challenging case of low levels of measurements. A lower L/m ratio (60% - 90%) leads to better performance, especially with a small number of measurements. With the reduced sampling rate, SNR improves over the original Bagging by up to 24%. With a properly chosen sampling ratio, a reasonably small number of estimates K = 30 gives satisfying result, even though increasing K is discovered to always improve or at least maintain the performance.
The C-bound, introduced in Lacasse et al., gives a tight upper bound on the risk of a binary majority vote classifier. In this work, we present a first step towards extending this work to more complex outputs, by providing generalizations of the C-bound to the multiclass and multi-label settings.
We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical structure by examining whether multi-stage compression can further reduce the description length. We applied our criteria to determine the number of clusters to synthetic data and empirical neuroimaging data to observe the behavior of the criteria across different types of data set and suitability of the two types of criteria for different datasets. We found that our method can offer reasonable clustering results that are useful for dimension reduction. While our numerical results revealed dependency of our criteria on the various aspects of dataset such as the dimensionality, the description length approach proposed here provides a useful guidance to determine the number of clusters in a principled manner when underlying properties of the data are unknown and only inferred from observation of data.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا