ﻻ يوجد ملخص باللغة العربية
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $varoptk$, that dominates all previous schemes in terms of estimation quality. $varoptk$ provides {em variance optimal unbiased estimation of subset sums}. More precisely, if we have seen $n$ items of the stream, then for {em any} subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of {em particular} subsets than previously possible. It is efficient, handling each new item of the stream in $O(log k)$ time. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.
This is paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and e cient method for weighted sampling from a stream of uniquely keyed items,
We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying data distr
We develop novel techniques which allow us to prove a diverse range of results relating to subset sums and complete sequences of positive integers, including solutions to several longstanding open problems. These include: solutions to the three probl
In this work, we consider the problem of sampling a $k$-clique in a graph from an almost uniform distribution in sublinear time in the general graph query model. Specifically the algorithm should output each $k$-clique with probability $(1pm epsilon)
Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a