No Arabic abstract
It is well understood that Bayesian decision theory and average case analysis are essentially identical. However, if one is interested in performing uncertainty quantification for a numerical task, it can be argued that standard approaches from the decision-theoretic framework are neither appropriate nor sufficient. Instead, we consider a particular optimality criterion from Bayesian experimental design and study its implied optimal information in the numerical context. This information is demonstrated to differ, in general, from the information that would be used in an average-case-optimal numerical method. The explicit connection to Bayesian experimental design suggests several distinct regimes in which optimal probabilistic numerical methods can be developed.
Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows users to quantify their uncertainty about the solution. The standard approach to BQ is based on a Gaussian process (GP) approximation of the integrand. As a result, BQ is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting very high-dimensional or non-smooth target functions. This paper proposes to tackle this issue with a new Bayesian numerical integration algorithm based on Bayesian Additive Regression Trees (BART) priors, which we call BART-Int. BART priors are easy to tune and well-suited for discontinuous functions. We demonstrate that they also lend themselves naturally to a sequential design setting and that explicit convergence rates can be obtained in a variety of settings. The advantages and disadvantages of this new methodology are highlighted on a set of benchmark tests including the Genz functions, and on a Bayesian survey design problem.
Massive data bring the big challenges of memory and computation for analysis. These challenges can be tackled by taking subsamples from the full data as a surrogate. For functional data, it is common to collect multiple measurements over their domains, which require even more memory and computation time when the sample size is large. The computation would be much more intensive when statistical inference is required through bootstrap samples. To the best of our knowledge, this article is the first attempt to study the subsampling method for the functional linear model. We propose an optimal subsampling method based on the functional L-optimality criterion. When the response is a discrete or categorical variable, we further extend our proposed functional L-optimality subsampling (FLoS) method to the functional generalized linear model. We establish the asymptotic properties of the estimators by the FLoS method. The finite sample performance of our proposed FLoS method is investigated by extensive simulation studies. The FLoS method is further demonstrated by analyzing two large-scale datasets: the global climate data and the kidney transplant data. The analysis results on these data show that the FLoS method is much better than the uniform subsampling approach and can well approximate the results based on the full data while dramatically reducing the computation time and memory.
We propose a novel generalisation to the Student-t Probabilistic Principal Component methodology which: (1) accounts for an asymmetric distribution of the observation data; (2) is a framework for grouped and generalised multiple-degree-of-freedom structures, which provides a more flexible approach to modelling groups of marginal tail dependence in the observation data; and (3) separates the tail effect of the error terms and factors. The new feature extraction methods are derived in an incomplete data setting to efficiently handle the presence of missing values in the observation vector. We discuss various special cases of the algorithm being a result of simplified assumptions on the process generating the data. The applicability of the new framework is illustrated on a data set that consists of crypto currencies with the highest market capitalisation.
Climate change impact studies inform policymakers on the estimated damages of future climate change on economic, health and other outcomes. In most studies, an annual outcome variable is observed, e.g. agricultural yield, annual mortality or gross domestic product, along with a higher-frequency regressor, e.g. daily temperature. While applied researchers tend to consider multiple models to characterize the relationship between the outcome and the high-frequency regressor, to inform policy a choice between the damage functions implied by the different models has to be made. This paper formalizes the model selection problem in this empirical setting and provides conditions for the consistency of Monte Carlo Cross-validation and generalized information criteria. A simulation study illustrates the theoretical results and points to the relevance of the signal-to-noise ratio for the finite-sample behavior of the model selection criteria. Two empirical applications with starkly different signal-to-noise ratios illustrate the practical implications of the formal analysis on model selection criteria provided in this paper.
Archetypal analysis is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of archetypal analysis in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that, provided the data is approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of archetypal analysis. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of archetypal analysis. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.