No Arabic abstract
Taking the Fourier integral theorem as our starting point, in this paper we focus on natural Monte Carlo and fully nonparametric estimators of multivariate distributions and conditional distribution functions. We do this without the need for any estimated covariance matrix or dependence structure between variables. These aspects arise immediately from the integral theorem. Being able to model multivariate data sets using conditional distribution functions we can study a number of problems, such as prediction for Markov processes, estimation of mixing distribution functions which depend on covariates, and general multivariate data. Estimators are explicit Monte Carlo based and require no recursive or iterative algorithms.
Starting with the Fourier integral theorem, we present natural Monte Carlo estimators of multivariate functions including densities, mixing densities, transition densities, regression functions, and the search for modes of multivariate density functions (modal regression). Rates of convergence are established and, in many cases, provide superior rates to current standard estimators such as those based on kernels, including kernel density estimators and kernel regression functions. Numerical illustrations are presented.
Sparse principal component analysis (PCA) is an important technique for dimensionality reduction of high-dimensional data. However, most existing sparse PCA algorithms are based on non-convex optimization, which provide little guarantee on the global convergence. Sparse PCA algorithms based on a convex formulation, for example the Fantope projection and selection (FPS), overcome this difficulty, but are computationally expensive. In this work we study sparse PCA based on the convex FPS formulation, and propose a new algorithm that is computationally efficient and applicable to large and high-dimensional data sets. Nonasymptotic and explicit bounds are derived for both the optimization error and the statistical accuracy, which can be used for testing and inference problems. We also extend our algorithm to online learning problems, where data are obtained in a streaming fashion. The proposed algorithm is applied to high-dimensional gene expression data for the detection of functional gene groups.
When we use simulation to evaluate the performance of a stochastic system, the simulation often contains input distributions estimated from real-world data; therefore, there is both simulation and input uncertainty in the performance estimates. Ignoring either source of uncertainty underestimates the overall statistical error. Simulation uncertainty can be reduced by additional computation (e.g., more replications). Input uncertainty can be reduced by collecting more real-world data, when feasible. This paper proposes an approach to quantify overall statistical uncertainty when the simulation is driven by independent parametric input distributions; specifically, we produce a confidence interval that accounts for both simulation and input uncertainty by using a metamodel-assisted bootstrapping approach. The input uncertainty is measured via bootstrapping, an equation-based stochastic kriging metamodel propagates the input uncertainty to the output mean, and both simulation and metamodel uncertainty are derived using properties of the metamodel. A variance decomposition is proposed to estimate the relative contribution of input to overall uncertainty; this information indicates whether the overall uncertainty can be significantly reduced through additional simulation alone. Asymptotic analysis provides theoretical support for our approach, while an empirical study demonstrates that it has good finite-sample performance.
A principal component analysis based on the generalized Gini correlation index is proposed (Gini PCA). The Gini PCA generalizes the standard PCA based on the variance. It is shown, in the Gaussian case, that the standard PCA is equivalent to the Gini PCA. It is also proven that the dimensionality reduction based on the generalized Gini correlation matrix, that relies on city-block distances, is robust to outliers. Monte Carlo simulations and an application on cars data (with outliers) show the robustness of the Gini PCA and provide different interpretations of the results compared with the variance PCA.
We propose a penalized likelihood method to jointly estimate multiple precision matrices for use in quadratic discriminant analysis and model based clustering. A ridge penalty and a ridge fusion penalty are used to introduce shrinkage and promote similarity between precision matrix estimates. Block-wise coordinate descent is used for optimization, and validation likelihood is used for tuning parameter selection. Our method is applied in quadratic discriminant analysis and semi-supervised model based clustering.