ترغب بنشر مسار تعليمي؟ اضغط هنا

C-MinHash: Practically Reducing Two Permutations to Just One

173   0   0.0 ( 0 )
 نشر من قبل Ping Li
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two independent permutations. In this paper, we show that one can actually simply use one permutation. That is, one single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes. Although the theoretical analysis becomes very complicated, we are able to explicitly write down the expression for the expectation of the estimator. The new estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy (mean square errors). An extensive set of experiments are provided to verify our claim for using just one permutation.

قيم البحث

اقرأ أيضاً

77 - Xiaoyun Li , Ping Li 2021
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.
In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size $O(log n)$ to buckets of size $O(loglog n)$ by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as w e build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHashs features of streaming updates, unions, and cardinality estimation. For a multiplicative approximation error $1+ epsilon$ on a Jaccard index $ t $, given a random oracle, HyperMinHash needs $Oleft(epsilon^{-2} left( loglog n + log frac{1}{ t epsilon} right)right)$ space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of $10^{19}$ with relative error of around 10% using 64KiB of memory; MinHash can only estimate Jaccard indices for cardinalities of $10^{10}$ with the same memory consumption.
Given a Counting Monadic Second Order (CMSO) sentence $psi$, the CMSO$[psi]$ problem is defined as follows. The input to CMSO$[psi]$ is a graph $G$, and the objective is to determine whether $Gmodels psi$. Our main theorem states that for every CMSO sentence $psi$, if CMSO$[psi]$ is solvable in polynomial time on globally highly connected graphs, then CMSO$[psi]$ is solvable in polynomial time (on general graphs). We demonstrate the utility of our theorem in the design of parameterized algorithms. Specifically we show that technical problem-specific ingredients of a powerful method for designing parameterized algorithms, recursive understanding, can be replaced by a black-box invocation of our main theorem. We also show that our theorem can be easily deployed to show fixed parameterized tractability of a wide range of problems, where the input is a graph $G$ and the task is to find a connected induced subgraph of $G$ such that few vertices in this subgraph have neighbors outside the subgraph, and additionally the subgraph has a CMSO-definable property.
We initiate the study of property testing problems concerning relations between permutations. In such problems, the input is a tuple $(sigma_1,dotsc,sigma_d)$ of permutations on ${1,dotsc,n}$, and one wishes to determine whether this tuple satisfies a certain system of relations $E$, or is far from every tuple that satisfies $E$. If this computational problem can be solved by querying only a small number of entries of the given permutations, we say that $E$ is testable. For example, when $d=2$ and $E$ consists of the single relation $mathsf{XY=YX}$, this corresponds to testing whether $sigma_1sigma_2=sigma_2sigma_1$, where $sigma_1sigma_2$ and $sigma_2sigma_1$ denote composition of permutations. We define a collection of graphs, naturally associated with the system $E$, that encodes all the information relevant to the testability of $E$. We then prove two theorems that provide criteria for testability and non-testability in terms of expansion properties of these graphs. By virtue of a deep connection with group theory, both theorems are applicable to wide classes of systems of relations. In addition, we formulate the well-studied group-theoretic notion of stability in permutations as a special case of the testability notion above, interpret all previous works on stability as testability results, survey previous results on stability from a computational perspective, and describe many directions for future research on stability and testability.
The complexity of a computational problem is traditionally quantified based on the hardness of its worst case. This approach has many advantages and has led to a deep and beautiful theory. However, from the practical perspective, this leaves much to be desired. In application areas, practically interesting instances very often occupy just a tiny part of an algorithms space of instances, and the vast majority of instances are simply irrelevant. Addressing these issues is a major challenge for theoretical computer science which may make theory more relevant to the practice of computer science. Following Bilu and Linial, we apply this perspective to MAXCUT, viewed as a clustering problem. Using a variety of techniques, we investigate practically interesting instances of this problem. Specifically, we show how to solve in polynomial time distinguished, metric, expanding and dense instances of MAXCUT under mild stability assumptions. In particular, $(1+epsilon)$-stability (which is optimal) suffices for metric and dense MAXCUT. We also show how to solve in polynomial time $Omega(sqrt{n})$-stable instances of MAXCUT, substantially improving the best previously known result.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا