No Arabic abstract
Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top $K$ eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top $K$ eigenvectors. In particular, we show that for distributions with symmetric innovation, the empirical top eigenspaces are unbiased and hence the distributed PCA is unbiased. We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigen-gap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigen-structures.
Mixture models are regularly used in density estimation applications, but the problem of estimating the mixing distribution remains a challenge. Nonparametric maximum likelihood produce estimates of the mixing distribution that are discrete, and these may be hard to interpret when the true mixing distribution is believed to have a smooth density. In this paper, we investigate an algorithm that produces a sequence of smooth estimates that has been conjectured to converge to the nonparametric maximum likelihood estimator. Here we give a rigorous proof of this conjecture, and propose a new data-driven stopping rule that produces smooth near-maximum likelihood estimates of the mixing density, and simulations demonstrate the quality empirical performance of this estimator.
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant alpha. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an alpha-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to alpha are lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as alpha approaches 1: the maximally skewed, strictly stable distribution with alpha = 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant zeta > 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with zeta = 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.
In this note we provide explicit expressions and expansions for a special function which appears in nonparametric estimation of log-densities. This function returns the integral of a log-linear function on a simplex of arbitrary dimension. In particular it is used in the R-package LogCondDEAD by Cule et al. (2007).
We consider the problem of computing the joint distribution of order statistics of stochastically independent random variables in one- and two-group models. While recursive formulas for evaluating the joint cumulative distribution function of such order statistics exist in the literature for a longer time, their numerical implementation remains a challenging task. We tackle this task by presenting novel generalizations of known recursions which we utilize to obtain exact results (calculated in rational arithmetic) as well as faithfully rounded results. Finally, some applications in stepwise multiple hypothesis testing are discussed.
Nested sampling is a simulation method for approximating marginal likelihoods proposed by Skilling (2006). We establish that nested sampling has an approximation error that vanishes at the standard Monte Carlo rate and that this error is asymptotically Gaussian. We show that the asymptotic variance of the nested sampling approximation typically grows linearly with the dimension of the parameter. We discuss the applicability and efficiency of nested sampling in realistic problems, and we compare it with two current methods for computing marginal likelihood. We propose an extension that avoids resorting to Markov chain Monte Carlo to obtain the simulated points.