The Optimal Quantile Estimator for Compressed Counting

73 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ping Li

تاريخ النشر 2008

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ping Li

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Compressed Counting (CC) was recently proposed for very efficiently computing the (approximate) $alpha$th frequency moments of data streams, where $0<alpha <= 2$. Several estimators were reported including the geometric mean estimator, the harmonic mean estimator, the optimal power estimator, etc. The geometric mean estimator is particularly interesting for theoretical purposes. For example, when $alpha -> 1$, the complexity of CC (using the geometric mean estimator) is $O(1/epsilon)$, breaking the well-known large-deviation bound $O(1/epsilon^2)$. The case $alphaapprox 1$ has important applications, for example, computing entropy of data streams. For practical purposes, this study proposes the optimal quantile estimator. Compared with previous estimators, this estimator is computationally more efficient and is also more accurate when $alpha> 1$.

قيم البحث

173 - Ping Li 2008

Compressed Counting (CC)} was recently proposed for approximating the $alpha$th frequency moments of data streams, for $0<alpha leq 2$. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stabl e random projections}, especially as $alphato 1$. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial feature for data mining. The Renyi entropy and the Tsallis entropy are functions of the $alpha$th frequency moments; and both approach the Shannon entropy as $alphato 1$. A recent theoretical work suggested using the $alpha$th frequency moment to approximate the Shannon entropy with $alpha=1+delta$ and very small $|delta|$ (e.g., $<10^{-4}$). In this study, we experiment using CC to estimate frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As $alphato 1$, CC dramatically improves {em symmetric stable random projections} in estimating frequency moments, Renyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach infinity. (2) Using {em symmetric stable random projections} and $alpha = 1+delta$ with very small $|delta|$ does not provide a practical algorithm because the required sample size is enormous.

بنى وهياكل البيانات والخوارزميات

A quantile regression estimator for censored data

136 - Chenlei Leng , Xingwei Tong 2013

We propose a censored quantile regression estimator motivated by unbiased estimating equations. Under the usual conditional independence assumption of the survival time and the censoring time given the covariates, we show that the proposed estimator is consistent and asymptotically normal. We develop an efficient computational algorithm which uses existing quantile regression code. As a result, bootstrap-type inference can be efficiently implemented. We illustrate the finite-sample performance of the proposed method by simulation studies and analysis of a survival data set.

نظرية الإحصاء نظرية الإحصاء

Towards a Decomposition-Optimal Algorithm for Counting and Sampling Arbitrary Motifs in Sublinear Time

115 - Amartya Shankha Biswas , Talya Eden , Ronitt Rubinfeld 2021

We consider the problem of sampling and approximately counting an arbitrary given motif $H$ in a graph $G$, where access to $G$ is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these ta sks were based on a decomposition of $H$ into a collection of odd cycles and stars, denoted $mathcal{D}^*(H)={O_{k_1}, ldots, O_{k_q}, S_{p_1}, ldots, S_{p_ell}}$. These algorithms were shown to be optimal for the case where $H$ is a clique or an odd-length cycle, but no other lower bounds were known. We present a new algorithm for sampling and approximately counting arbitrary motifs which, up to $textrm{poly}(log n)$ factors, is always at least as good as previous results, and for most graphs $G$ is strictly better. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the $p$-th moment of the degree distribution. Finally, we prove that this algorithm is emph{decomposition-optimal} for decompositions that contain at least one odd cycle. These are the first lower bounds for motifs $H$ with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition.

بنى وهياكل البيانات والخوارزميات

Almost Optimal Bounds for Sublinear-Time Sampling of $k$-Cliques: Sampling Cliques is Harder Than Counting

100 - Talya Eden , Dana Ron , Will Rosenbaum 2020

In this work, we consider the problem of sampling a $k$-clique in a graph from an almost uniform distribution in sublinear time in the general graph query model. Specifically the algorithm should output each $k$-clique with probability $(1pm epsilon) /n_k$, where $n_k$ denotes the number of $k$-cliques in the graph and $epsilon$ is a given approximation parameter. We prove that the query complexity of this problem is [ Theta^*left(maxleft{ left(frac{(nalpha)^{k/2}}{ n_k}right)^{frac{1}{k-1}} ,; minleft{nalpha,frac{nalpha^{k-1}}{n_k} right}right}right). ] where $n$ is the number of vertices in the graph, $alpha$ is its arboricity, and $Theta^*$ suppresses the dependence on $(log n/epsilon)^{O(k)}$. Interestingly, this establishes a separation between approximate counting and approximate uniform sampling in the sublinear regime. For example, if $k=3$, $alpha = O(1)$, and $n_3$ (the number of triangles) is $Theta(n)$, then we get a lower bound of $Omega(n^{1/4})$ (for constant $epsilon$), while under these conditions, a $(1pm epsilon)$-approximation of $n_3$ can be obtained by performing $textrm{poly}(log(n/epsilon))$ queries (Eden, Ron and Seshadhri, SODA20). Our lower bound follows from a construction of a family of graphs with arboricity $alpha$ such that in each graph there are $n_k$ cliques (of size $k$), where one of these cliques is hidden and hence hard to sample. Our upper bound is based on defining a special auxiliary graph $H_k$, such that sampling edges almost uniformly in $H_k$ translates to sampling $k$-cliques almost uniformly in the original graph $G$. We then build on a known edge-sampling algorithm (Eden, Ron and Rosenbaum, ICALP19) to sample edges in $H_k$, where the challenge is simulate queries to $H_k$ while being given access only to $G$.

بنى وهياكل البيانات والخوارزميات

Improved Algorithms for Adaptive Compressed Sensing

228 - Vasileios Nakos , Xiaofei Shi , David P. Woodruff 2018

In the problem of adaptive compressed sensing, one wants to estimate an approximately $k$-sparse vector $xinmathbb{R}^n$ from $m$ linear measurements $A_1 x, A_2 x,ldots, A_m x$, where $A_i$ can be chosen based on the outcomes $A_1 x,ldots, A_{i-1} x $ of previous measurements. The goal is to output a vector $hat{x}$ for which $$|x-hat{x}|_p le C cdot min_{ktext{-sparse } x} |x-x|_q,$$ with probability at least $2/3$, where $C > 0$ is an approximation factor. Indyk, Price and Woodruff (FOCS11) gave an algorithm for $p=q=2$ for $C = 1+epsilon$ with $Oh((k/epsilon) loglog (n/k))$ measurements and $Oh(log^*(k) loglog (n))$ rounds of adaptivity. We first improve their bounds, obtaining a scheme with $Oh(k cdot loglog (n/k) +(k/epsilon) cdot loglog(1/epsilon))$ measurements and $Oh(log^*(k) loglog (n))$ rounds, as well as a scheme with $Oh((k/epsilon) cdot loglog (nlog (n/k)))$ measurements and an optimal $Oh(loglog (n))$ rounds. We then provide novel adaptive compressed sensing schemes with improved bounds for $(p,p)$ for every $0 < p < 2$. We show that the improvement from $O(k log(n/k))$ measurements to $O(k log log (n/k))$ measurements in the adaptive setting can persist with a better $epsilon$-dependence for other values of $p$ and $q$. For example, when $(p,q) = (1,1)$, we obtain $O(frac{k}{sqrt{epsilon}} cdot log log n log^3 (frac{1}{epsilon}))$ measurements.

بنى وهياكل البيانات والخوارزميات نظرية المعلومات نظرية المعلومات

سجل دخول لتتمكن من نشر تعليقات