Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Density Sketches for Sampling and Estimation

135 0 0.0 ( 0 )

Download Cite

Added by Aditya Desai

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Aditya Desai - Benjamin Coleman - Anshumali Shrivastava

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying data distribution. Thus, analogous to popular generative models, DS allows us to succinctly replace the real-data in almost all machine learning pipelines with synthetic examples drawn from the same distribution as the original data. However, unlike generative models, which do not have any statistical guarantees, DS leads to theoretically sound asymptotically converging consistent estimators of the underlying density function. Density sketches also have many appealing properties making them ideal for large-scale distributed applications. DS construction is an online algorithm. The sketches are additive, i.e., the sum of two sketches is the sketch of the combined data. These properties allow data to be collected from distributed sources, compressed into a density sketch, efficiently transmitted in the sketch form to a central server, merged, and re-sampled into a synthetic database for modeling applications. Thus, density sketches can potentially revolutionize how we store, communicate, and distribute data.

rate research

Projection-Cost-Preserving Sketches: Proof Strategies and Constructions

138 - Cameron Musco , Christopher Musco 2020

In this note we illustrate how common matrix approximation methods, such as random projection and random sampling, yield projection-cost-preserving sketches, as introduced in [FSS13, CEM+15]. A projection-cost-preserving sketch is a matrix approximation which, for a given parameter $k$, approximately preserves the distance of the target matrix to all $k$-dimensional subspaces. Such sketches have applications to scalable algorithms for linear algebra, data science, and machine learning. Our goal is to simplify the presentation of proof techniques introduced in [CEM+15] and [CMM17] so that they can serve as a guide for future work. We also refer the reader to [CYD19], which gives a similar simplified exposition of the proof covered in Section 2.

Data Structures and Algorithms Machine Learning Numerical Analysis

Histogram Transform Ensembles for Density Estimation

127 - Hanyuan Hang 2019

We investigate an algorithm named histogram transform ensembles (HTE) density estimator whose effectiveness is supported by both solid theoretical analysis and significant experimental performance. On the theoretical side, by decomposing the error term into approximation error and estimation error, we are able to conduct the following analysis: First of all, we establish the universal consistency under $L_1(mu)$-norm. Secondly, under the assumption that the underlying density function resides in the H{o}lder space $C^{0,alpha}$, we prove almost optimal convergence rates for both single and ensemble density estimators under $L_1(mu)$-norm and $L_{infty}(mu)$-norm for different tail distributions, whereas in contrast, for its subspace $C^{1,alpha}$ consisting of smoother functions, almost optimal convergence rates can only be established for the ensembles and the lower bound of the single estimators illustrates the benefits of ensembles over single density estimators. In the experiments, we first carry out simulations to illustrate that histogram transform ensembles surpass single histogram transforms, which offers powerful evidence to support the theoretical results in the space $C^{1,alpha}$. Moreover, to further exert the experimental performances, we propose an adaptive version of HTE and study the parameters by generating several synthetic datasets with diversities in dimensions and distributions. Last but not least, real data experiments with other state-of-the-art density estimators demonstrate the accuracy of the adaptive HTE algorithm.

Statistics Theory Machine Learning Machine Learning

Stream sampling for variance-optimal estimation of subset sums

499 - Edith Cohen , Nick Duffield , Haim Kaplan 2010

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $varoptk$, that dominates all previous schemes in terms of estimation quality. $varoptk$ provides {em variance optimal unbiased estimation of subset sums}. More precisely, if we have seen $n$ items of the stream, then for {em any} subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of {em particular} subsets than previously possible. It is efficient, handling each new item of the stream in $O(log k)$ time. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

Data Structures and Algorithms

Uncertainty estimation for molecular dynamics and sampling

114 - Giulio Imbalzano , Yongbin Zhuang , Venkat Kapil 2020

Machine learning models have emerged as a very effective strategy to sidestep time-consuming electronic-structure calculations, enabling accurate simulations of greater size, time scale and complexity. Given the interpolative nature of these models, the reliability of predictions depends on the position in phase space, and it is crucial to obtain an estimate of the error that derives from the finite number of reference structures included during the training of the model. When using a machine-learning potential to sample a finite-temperature ensemble, the uncertainty on individual configurations translates into an error on thermodynamic averages, and provides an indication for the loss of accuracy when the simulation enters a previously unexplored region. Here we discuss how uncertainty quantification can be used, together with a baseline energy model, or a more robust although less accurate interatomic potential, to obtain more resilient simulations and to support active-learning strategies. Furthermore, we introduce an on-the-fly reweighing scheme that makes it possible to estimate the uncertainty in the thermodynamic averages extracted from long trajectories. We present examples covering different types of structural and thermodynamic properties, and systems as diverse as water and liquid gallium.

Chemical Physics Machine Learning Computational Physics

Optimal Coreset for Gaussian Kernel Density Estimation

168 - Wai Ming Tai 2020

Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.

Data Structures and Algorithms Computational Geometry Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Density Sketches for Sampling and Estimation

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions