ﻻ يوجد ملخص باللغة العربية
In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size $O(log n)$ to buckets of size $O(loglog n)$ by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHashs features of streaming updates, unions, and cardinality estimation. For a multiplicative approximation error $1+ epsilon$ on a Jaccard index $ t $, given a random oracle, HyperMinHash needs $Oleft(epsilon^{-2} left( loglog n + log frac{1}{ t epsilon} right)right)$ space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of $10^{19}$ with relative error of around 10% using 64KiB of memory; MinHash can only estimate Jaccard indices for cardinalities of $10^{10}$ with the same memory consumption.
Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHas
Like [1], we present an algorithm to compute the simulation of a query pattern in a graph of labeled nodes and unlabeled edges. However, our algorithm works on a compressed graph grammar, instead of on the original graph. The speed-up of our algorith
Irreducible frequent patters (IFPs) are introduced for transactional databases. An IFP is such a frequent pattern (FP),(x1,x2,...xn), the probability of which, P(x1,x2,...xn), cannot be represented as a product of the probabilities of two (or more) o
Motivated by crowdsourced computation, peer-grading, and recommendation systems, Braverman, Mao and Weinberg [STOC16] studied the emph{query} and emph{round} complexity of fundamental problems such as finding the maximum (textsc{max}), finding all el
Network reliability is an important metric to evaluate the connectivity among given vertices in uncertain graphs. Since the network reliability problem is known as #P-complete, existing studies have used approximation techniques. In this paper, we pr