ترغب بنشر مسار تعليمي؟ اضغط هنا

A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

174   0   0.0 ( 0 )
 نشر من قبل Ping Li
 تاريخ النشر 2008
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English
 تأليف Ping Li




اسأل ChatGPT حول البحث

Compressed Counting (CC)} was recently proposed for approximating the $alpha$th frequency moments of data streams, for $0<alpha leq 2$. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections}, especially as $alphato 1$. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial feature for data mining. The Renyi entropy and the Tsallis entropy are functions of the $alpha$th frequency moments; and both approach the Shannon entropy as $alphato 1$. A recent theoretical work suggested using the $alpha$th frequency moment to approximate the Shannon entropy with $alpha=1+delta$ and very small $|delta|$ (e.g., $<10^{-4}$). In this study, we experiment using CC to estimate frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As $alphato 1$, CC dramatically improves {em symmetric stable random projections} in estimating frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach infinity. (2) Using {em symmetric stable random projections} and $alpha = 1+delta$ with very small $|delta|$ does not provide a practical algorithm because the required sample size is enormous.

قيم البحث

اقرأ أيضاً

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate m ethods. The algorithm of choice used to quickly compute the Jaccard index $frac{vert A cap B vert}{vert Acup Bvert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1pm varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.
73 - Ping Li 2008
Compressed Counting (CC) was recently proposed for very efficiently computing the (approximate) $alpha$th frequency moments of data streams, where $0<alpha <= 2$. Several estimators were reported including the geometric mean estimator, the harmonic m ean estimator, the optimal power estimator, etc. The geometric mean estimator is particularly interesting for theoretical purposes. For example, when $alpha -> 1$, the complexity of CC (using the geometric mean estimator) is $O(1/epsilon)$, breaking the well-known large-deviation bound $O(1/epsilon^2)$. The case $alphaapprox 1$ has important applications, for example, computing entropy of data streams. For practical purposes, this study proposes the optimal quantile estimator. Compared with previous estimators, this estimator is computationally more efficient and is also more accurate when $alpha> 1$.
A sketch is a probabilistic data structure used to record frequencies of items in a multi-set. Sketches are widely used in various fields, especially those that involve processing and storing data streams. In streaming applications with high data rat es, a sketch fills up very quickly. Thus, its contents are periodically transferred to the remote collector, which is responsible for answering queries. In this paper, we propose a new sketch, called Slim-Fat (SF) sketch, which has a significantly higher accuracy compared to prior art, a much smaller memory footprint, and at the same time achieves the same speed as the best prior sketch. The key idea behind our proposed SF-sketch is to maintain two separate sketches: a small sketch called Slim-subsketch and a large sketch called Fat-subsketch. The Slim-subsketch is periodically transferred to the remote collector for answering queries quickly and accurately. The Fat-subsketch, however, is not transferred to the remote collector because it is used only to assist the Slim-subsketch during the insertions and deletions and is not used to answer queries. We implemented and extensively evaluated SF-sketch along with several prior sketches and compared them side by side. Our experimental results show that SF-sketch outperforms the most widely used CM-sketch by up to 33.1 times in terms of accuracy. We have released the source codes of our proposed sketch as well as existing sketches at Github. The short version of this paper will appear in ICDE 2017.
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(ncdot mbox{polylog}~n)$-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the quality of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(ncdot mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.
Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compress
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا