No Arabic abstract
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant alpha. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an alpha-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to alpha are lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as alpha approaches 1: the maximally skewed, strictly stable distribution with alpha = 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant zeta > 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with zeta = 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.
In this note we provide explicit expressions and expansions for a special function which appears in nonparametric estimation of log-densities. This function returns the integral of a log-linear function on a simplex of arbitrary dimension. In particular it is used in the R-package LogCondDEAD by Cule et al. (2007).
Recently a new algorithm for sampling posteriors of unnormalised probability densities, called ABC Shadow, was proposed in [8]. This talk introduces a global optimisation procedure based on the ABC Shadow simulation dynamics. First the general method is explained, and then results on simulated and real data are presented. The method is rather general, in the sense that it applies for probability densities that are continuously differentiable with respect to their parameters
Mixture models are regularly used in density estimation applications, but the problem of estimating the mixing distribution remains a challenge. Nonparametric maximum likelihood produce estimates of the mixing distribution that are discrete, and these may be hard to interpret when the true mixing distribution is believed to have a smooth density. In this paper, we investigate an algorithm that produces a sequence of smooth estimates that has been conjectured to converge to the nonparametric maximum likelihood estimator. Here we give a rigorous proof of this conjecture, and propose a new data-driven stopping rule that produces smooth near-maximum likelihood estimates of the mixing density, and simulations demonstrate the quality empirical performance of this estimator.
Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top $K$ eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top $K$ eigenvectors. In particular, we show that for distributions with symmetric innovation, the empirical top eigenspaces are unbiased and hence the distributed PCA is unbiased. We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigen-gap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigen-structures.
We consider a Bayesian hierarchical version of the normal theory general linear model which is practically relevant in the sense that it is general enough to have many applications and it is not straightforward to sample directly from the corresponding posterior distribution. Thus we study a block Gibbs sampler that has the posterior as its invariant distribution. In particular, we establish that the Gibbs sampler converges at a geometric rate. This allows us to establish conditions for a central limit theorem for the ergodic averages used to estimate features of the posterior. Geometric ergodicity is also a key component for using batch means methods to consistently estimate the variance of the asymptotic normal distribution. Together, our results give practitioners the tools to be as confident in inferences based on the observations from the Gibbs sampler as they would be with inferences based on random samples from the posterior. Our theoretical results are illustrated with an application to data on the cost of health plans issued by health maintenance organizations.