No Arabic abstract
Handling big data has largely been a major bottleneck in traditional statistical models. Consequently, when accurate point prediction is the primary target, machine learning models are often preferred over their statistical counterparts for bigger problems. But full probabilistic statistical models often outperform other models in quantifying uncertainties associated with model predictions. We develop a data-driven statistical modeling framework that combines the uncertainties from an ensemble of statistical models learned on smaller subsets of data carefully chosen to account for imbalances in the input space. We demonstrate this method on a photometric redshift estimation problem in cosmology, which seeks to infer a distribution of the redshift -- the stretching effect in observing the light of far-away galaxies -- given multivariate color information observed for an object in the sky. Our proposed method performs balanced partitioning, graph-based data subsampling across the partitions, and training of an ensemble of Gaussian process models.
To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequalities, finite sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.
Massive data bring the big challenges of memory and computation for analysis. These challenges can be tackled by taking subsamples from the full data as a surrogate. For functional data, it is common to collect multiple measurements over their domains, which require even more memory and computation time when the sample size is large. The computation would be much more intensive when statistical inference is required through bootstrap samples. To the best of our knowledge, this article is the first attempt to study the subsampling method for the functional linear model. We propose an optimal subsampling method based on the functional L-optimality criterion. When the response is a discrete or categorical variable, we further extend our proposed functional L-optimality subsampling (FLoS) method to the functional generalized linear model. We establish the asymptotic properties of the estimators by the FLoS method. The finite sample performance of our proposed FLoS method is investigated by extensive simulation studies. The FLoS method is further demonstrated by analyzing two large-scale datasets: the global climate data and the kidney transplant data. The analysis results on these data show that the FLoS method is much better than the uniform subsampling approach and can well approximate the results based on the full data while dramatically reducing the computation time and memory.
Distribution function is essential in statistical inference, and connected with samples to form a directed closed loop by the correspondence theorem in measure theory and the Glivenko-Cantelli and Donsker properties. This connection creates a paradigm for statistical inference. However, existing distribution functions are defined in Euclidean spaces and no longer convenient to use in rapidly evolving data objects of complex nature. It is imperative to develop the concept of distribution function in a more general space to meet emerging needs. Note that the linearity allows us to use hypercubes to define the distribution function in a Euclidean space, but without the linearity in a metric space, we must work with the metric to investigate the probability measure. We introduce a class of metric distribution functions through the metric between random objects and a fixed location in metric spaces. We overcome this challenging step by proving the correspondence theorem and the Glivenko-Cantelli theorem for metric distribution functions in metric spaces that lie the foundation for conducting rational statistical inference for metric space-valued data. Then, we develop homogeneity test and mutual independence test for non-Euclidean random objects, and present comprehensive empirical evidence to support the performance of our proposed methods.
With the rapid development of data collection and aggregation technologies in many scientific disciplines, it is becoming increasingly ubiquitous to conduct large-scale or online regression to analyze real-world data and unveil real-world evidence. In such applications, it is often numerically challenging or sometimes infeasible to store the entire dataset in memory. Consequently, classical batch-based estimation methods that involve the entire dataset are less attractive or no longer applicable. Instead, recursive estimation methods such as stochastic gradient descent that process data points sequentially are more appealing, exhibiting both numerical convenience and memory efficiency. In this paper, for scalable estimation of large or online survival data, we propose a stochastic gradient descent method which recursively updates the estimates in an online manner as data points arrive sequentially in streams. Theoretical results such as asymptotic normality and estimation efficiency are established to justify its validity. Furthermore, to quantify the uncertainty associated with the proposed stochastic gradient descent estimator and facilitate statistical inference, we develop a scalable resampling strategy that specifically caters to the large-scale or online setting. Simulation studies and a real data application are also provided to assess its performance and illustrate its practical utility.
The celebrated Bernstein von-Mises theorem ensures that credible regions from Bayesian posterior are well-calibrated when the model is correctly-specified, in the frequentist sense that their coverage probabilities tend to the nominal values as data accrue. However, this conventional Bayesian framework is known to lack robustness when the model is misspecified or only partly specified, such as in quantile regression, risk minimization based supervised/unsupervised learning and robust estimation. To overcome this difficulty, we propose a new Bayesian inferential approach that substitutes the (misspecified or partly specified) likelihoods with proper exponentially tilted empirical likelihoods plus a regularization term. Our surrogate empirical likelihood is carefully constructed by using the first order optimality condition of the empirical risk minimization as the moment condition. We show that the Bayesian posterior obtained by combining this surrogate empirical likelihood and the prior is asymptotically close to a normal distribution centering at the empirical risk minimizer with covariance matrix taking an appropriate sandwiched form. Consequently, the resulting Bayesian credible regions are automatically calibrated to deliver valid uncertainty quantification. Computationally, the proposed method can be easily implemented by Markov Chain Monte Carlo sampling algorithms. Our numerical results show that the proposed method tends to be more accurate than existing state-of-the-art competitors.