No Arabic abstract
Stochastic kriging has been widely employed for simulation metamodeling to predict the response surface of a complex simulation model. However, its use is limited to cases where the design space is low-dimensional, because the number of design points required for stochastic kriging to produce accurate prediction, in general, grows exponentially in the dimension of the design space. The large sample size results in both a prohibitive sample cost for running the simulation model and a severe computational challenge due to the need of inverting large covariance matrices. Based on tensor Markov kernels and sparse grid experimental designs, we develop a novel methodology that dramatically alleviates the curse of dimensionality. We show that the sample complexity of the proposed methodology grows very mildly in the dimension, even under model misspecification. We also develop fast algorithms that compute stochastic kriging in its exact form without any approximation schemes. We demonstrate via extensive numerical experiments that our methodology can handle problems with a design space of more than 10,000 dimensions, improving both prediction accuracy and computational efficiency by orders of magnitude relative to typical alternative methods in practice.
Estimating copulas with discrete marginal distributions is challenging, especially in high dimensions, because computing the likelihood contribution of each observation requires evaluating $2^{J}$ terms, with $J$ the number of discrete variables. Currently, data augmentation methods are used to carry out inference for discrete copula and, in practice, the computation becomes infeasible when $J$ is large. Our article proposes two new fast Bayesian approaches for estimating high dimensional copulas with discrete margins, or a combination of discrete and continuous margins. Both methods are based on recent advances in Bayesian methodology that work with an unbiased estimate of the likelihood rather than the likelihood itself, and our key observation is that we can estimate the likelihood of a discrete copula unbiasedly with much less computation than evaluating the likelihood exactly or with current simulation methods that are based on augmenting the model with latent variables. The first approach builds on the pseudo marginal method that allows Markov chain Monte Carlo simulation from the posterior distribution using only an unbiased estimate of the likelihood. The second approach is based on a Variational Bayes approximation to the posterior and also uses an unbiased estimate of the likelihood. We show that Monte Carlo and randomised quasi Monte Carlo methods can be used with both approaches to reduce the variability of the estimate of the likelihood, and hence enable us to carry out Bayesian inference for high values of $J$ for some classes of copulas where the computation was previously too expensive. Our article also introduces {em a correlated quasi random number pseudo marginal} approach into the literature. The methodology is illustrated through several real and simulated data examples.
The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.
This article proposes a visualization method for multidimensional data based on: (i) Animated functional Hypothetical Outcome Plots (f-HOPs); (ii) 3-dimensional Kiviat plot; and (iii) data sonification. In an Uncertainty Quantification (UQ) framework, such analysis coupled with standard statistical analysis tools such as Probability Density Functions (PDF) can be used to augment the understanding of how the uncertainties in the numerical code inputs translate into uncertainties in the quantity of interest (QoI). In contrast with static representation of most advanced techniques such as functional Highest Density Region (HDR) boxplot or functional boxplot, f-HOPs is a dynamic visualization that enables the practitioners to infer the dynamics of the physics and enables to see functional correlations that may exist. While this technique only allows to represent the QoI, we propose a 3-dimensional version of the Kiviat plot to encode all input parameters. This new visualization takes advantage of information from f-HOPs through data sonification. All in all, this allows to analyse large datasets within a high-dimensional parameter space and a functional QoI in the same canvas. The proposed method is assessed and showed its benefits on two related environmental datasets.
Complex phenomena in engineering and the sciences are often modeled with computationally intensive feed-forward simulations for which a tractable analytic likelihood does not exist. In these cases, it is sometimes necessary to estimate an approximate likelihood or fit a fast emulator model for efficient statistical inference; such surrogate models include Gaussian synthetic likelihoods and more recently neural density estimators such as autoregressive models and normalizing flows. To date, however, there is no consistent way of quantifying the quality of such a fit. Here we propose a statistical framework that can distinguish any arbitrary misspecified model from the target likelihood, and that in addition can identify with statistical confidence the regions of parameter as well as feature space where the fit is inadequate. Our validation method applies to settings where simulations are extremely costly and generated in batches or ensembles at fixed locations in parameter space. At the heart of our approach is a two-sample test that quantifies the quality of the fit at fixed parameter values, and a global test that assesses goodness-of-fit across simulation parameters. While our general framework can incorporate any test statistic or distance metric, we specifically argue for a new two-sample test that can leverage any regression method to attain high power and provide diagnostics in complex data settings.
Simulation offers a simple and flexible way to estimate the power of a clinical trial when analytic formulae are not available. The computational burden of using simulation has, however, restricted its application to only the simplest of sample size determination problems, minimising a single parameter (the overall sample size) subject to power being above a target level. We describe a general framework for solving simulation-based sample size determination problems with several design parameters over which to optimise and several conflicting criteria to be minimised. The method is based on an established global optimisation algorithm widely used in the design and analysis of computer experiments, using a non-parametric regression model as an approximation of the true underlying power function. The method is flexible, can be used for almost any problem for which power can be estimated using simulation, and can be implemented using existing statistical software packages. We illustrate its application to three increasingly complicated sample size determination problems involving complex clustering structures, co-primary endpoints, and small sample considerations.