ﻻ يوجد ملخص باللغة العربية
Compressed Counting (CC)} was recently proposed for approximating the $alpha$th frequency moments of data streams, for $0<alpha leq 2$. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections}, especially as $alphato 1$. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial feature for data mining. The Renyi entropy and the Tsallis entropy are functions of the $alpha$th frequency moments; and both approach the Shannon entropy as $alphato 1$. A recent theoretical work suggested using the $alpha$th frequency moment to approximate the Shannon entropy with $alpha=1+delta$ and very small $|delta|$ (e.g., $<10^{-4}$). In this study, we experiment using CC to estimate frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As $alphato 1$, CC dramatically improves {em symmetric stable random projections} in estimating frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach infinity. (2) Using {em symmetric stable random projections} and $alpha = 1+delta$ with very small $|delta|$ does not provide a practical algorithm because the required sample size is enormous.
The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate m
Compressed Counting (CC) was recently proposed for very efficiently computing the (approximate) $alpha$th frequency moments of data streams, where $0<alpha <= 2$. Several estimators were reported including the geometric mean estimator, the harmonic m
A sketch is a probabilistic data structure used to record frequencies of items in a multi-set. Sketches are widely used in various fields, especially those that involve processing and storing data streams. In streaming applications with high data rat
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be
Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been