A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

174 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ping Li

تاريخ النشر 2008

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ping Li

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Compressed Counting (CC)} was recently proposed for approximating the $alpha$th frequency moments of data streams, for $0<alpha leq 2$. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections}, especially as $alphato 1$. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial feature for data mining. The Renyi entropy and the Tsallis entropy are functions of the $alpha$th frequency moments; and both approach the Shannon entropy as $alphato 1$. A recent theoretical work suggested using the $alpha$th frequency moment to approximate the Shannon entropy with $alpha=1+delta$ and very small $|delta|$ (e.g., $<10^{-4}$). In this study, we experiment using CC to estimate frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As $alphato 1$, CC dramatically improves {em symmetric stable random projections} in estimating frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach infinity. (2) Using {em symmetric stable random projections} and $alpha = 1+delta$ with very small $|delta|$ does not provide a practical algorithm because the required sample size is enormous.

قيم البحث

109 - Marc Bury , Chris Schwiegelshohn , Mara Sorella 2016

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate m ethods. The algorithm of choice used to quickly compute the Jaccard index $frac{vert A cap B vert}{vert Acup Bvert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1pm varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.

بنى وهياكل البيانات والخوارزميات

The Optimal Quantile Estimator for Compressed Counting

73 - Ping Li 2008

Compressed Counting (CC) was recently proposed for very efficiently computing the (approximate) $alpha$th frequency moments of data streams, where $0<alpha <= 2$. Several estimators were reported including the geometric mean estimator, the harmonic m ean estimator, the optimal power estimator, etc. The geometric mean estimator is particularly interesting for theoretical purposes. For example, when $alpha -> 1$, the complexity of CC (using the geometric mean estimator) is $O(1/epsilon)$, breaking the well-known large-deviation bound $O(1/epsilon^2)$. The case $alphaapprox 1$ has important applications, for example, computing entropy of data streams. For practical purposes, this study proposes the optimal quantile estimator. Compared with previous estimators, this estimator is computationally more efficient and is also more accurate when $alpha> 1$.

بنى وهياكل البيانات والخوارزميات

SF-sketch: A Two-stage Sketch for Data Streams

56 - Tong Yang , Lingtong Liu , Yibo Yan 2017

A sketch is a probabilistic data structure used to record frequencies of items in a multi-set. Sketches are widely used in various fields, especially those that involve processing and storing data streams. In streaming applications with high data rat es, a sketch fills up very quickly. Thus, its contents are periodically transferred to the remote collector, which is responsible for answering queries. In this paper, we propose a new sketch, called Slim-Fat (SF) sketch, which has a significantly higher accuracy compared to prior art, a much smaller memory footprint, and at the same time achieves the same speed as the best prior sketch. The key idea behind our proposed SF-sketch is to maintain two separate sketches: a small sketch called Slim-subsketch and a large sketch called Fat-subsketch. The Slim-subsketch is periodically transferred to the remote collector for answering queries quickly and accurately. The Fat-subsketch, however, is not transferred to the remote collector because it is used only to assist the Slim-subsketch during the insertions and deletions and is not used to answer queries. We implemented and extensively evaluated SF-sketch along with several prior sketches and compared them side by side. Our experimental results show that SF-sketch outperforms the most widely used CM-sketch by up to 33.1 times in terms of accuracy. We have released the source codes of our proposed sketch as well as existing sketches at Github. The short version of this paper will appear in ICDE 2017.

بنى وهياكل البيانات والخوارزميات

Correlation Clustering in Data Streams

128 - Kook Jin Ahn , Graham Cormode , Sudipto Guha 2018

Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(ncdot mbox{polylog}~n)$-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the quality of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(ncdot mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.

بنى وهياكل البيانات والخوارزميات

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

59 - Pinghui Wang , Yiyan Qi , Yuanming Zhang 2019

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compress

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات