No Arabic abstract
We propose and analyze a generalized splitting method to sample approximately from a distribution conditional on the occurrence of a rare event. This has important applications in a variety of contexts in operations research, engineering, and computational statistics. The method uses independent trials starting from a single particle. We exploit this independence to obtain asymptotic and non-asymptotic bounds on the total variation error of the sampler. Our main finding is that the approximation error depends crucially on the relative variability of the number of points produced by the splitting algorithm in one run, and that this relative variability can be readily estimated via simulation. We illustrate the relevance of the proposed method on an application in which one needs to sample (approximately) from an intractable posterior density in Bayesian inference.
Under measurement constraints, responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called response-free. We propose a response-free sampling procedure mbox{(OSUMC)} for generalized linear models (GLMs). Using the A-optimality criterion, i.e., the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the Optimal Sampling Under Measurement Constraints (OSUMC) algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives.
Monte Carlo (MC) sampling methods are widely applied in Bayesian inference, system simulation and optimization problems. The Markov Chain Monte Carlo (MCMC) algorithms are a well-known class of MC methods which generate a Markov chain with the desired invariant distribution. In this document, we focus on the Metropolis-Hastings (MH) sampler, which can be considered as the atom of the MCMC techniques, introducing the basic notions and different properties. We describe in details all the elements involved in the MH algorithm and the most relevant variants. Several improvements and recent extensions proposed in the literature are also briefly discussed, providing a quick but exhaustive overview of the current Metropolis-based samplings world.
Estimating the probability of rare failure events is an essential step in the reliability assessment of engineering systems. Computing this failure probability for complex non-linear systems is challenging, and has recently spurred the development of active-learning reliability methods. These methods approximate the limit-state function (LSF) using surrogate models trained with a sequentially enriched set of model evaluations. A recently proposed method called stochastic spectral embedding (SSE) aims to improve the local approximation accuracy of global, spectral surrogate modelling techniques by sequentially embedding local residual expansions in subdomains of the input space. In this work we apply SSE to the LSF, giving rise to a stochastic spectral embedding-based reliability (SSER) method. The resulting partition of the input space decomposes the failure probability into a set of easy-to-compute domain-wise failure probabilities. We propose a set of modifications that tailor the algorithm to efficiently solve rare event estimation problems. These modifications include specialized refinement domain selection, partitioning and enrichment strategies. We showcase the algorithm performance on four benchmark problems of various dimensionality and complexity in the LSF.
Environmental variability often has substantial impacts on natural populations and communities through its effects on the performance of individuals. Because organisms responses to environmental conditions are often nonlinear (e.g., decreasing performance on both sides of an optimal temperature), the mean response is often different from the response in the mean environment. Ye et. al. 2020, proposed testing for the presence of such variance effects on individual or population growth rates by estimating the Jensen Effect, the difference in average growth rates under varying versus fixed environments, in functional single index models for environmental effects on growth. In this paper, we extend this analysis to effect of environmental variance on reproduction and survival, which have count and binary outcomes. In the standard generalized linear models used to analyze such data the direction of the Jensen Effect is tacitly assumed a priori by the models link function. Here we extend the methods of Ye et. al. 2020 using a generalized single index model to test whether this assumed direction is contradicted by the data. We show that our test has reasonable power under mild alternatives, but requires sample sizes that are larger than are often available. We demonstrate our methods on a long-term time series of plant ground cover on the Idaho steppe.
In prediction problems, it is common to model the data-generating process and then use a model-based procedure, such as a Bayesian predictive distribution, to quantify uncertainty about the next observation. However, if the posited model is misspecified, then its predictions may not be calibrated -- that is, the predictive distributions quantiles may not be nominal frequentist prediction upper limits, even asymptotically. Rather than abandoning the comfort of a model-based formulation for a more complicated non-model-based approach, here we propose a strategy in which the data itself helps determine if the assumed model-based solution should be adjusted to account for model misspecification. This is achieved through a generalized Bayes formulation where a learning rate parameter is tuned, via the proposed generalized predictive calibration (GPrC) algorithm, to make the predictive distribution calibrated, even under model misspecification. Extensive numerical experiments are presented, under a variety of settings, demonstrating the proposed GPrC algorithms validity, efficiency, and robustness.