No Arabic abstract
Variable selection in ultra-high dimensional linear regression is often preceded by a screening step to significantly reduce the dimension. Here a Bayesian variable screening method (BITS) is developed. BITS can successfully integrate prior knowledge, if any, on effect sizes, and the number of true variables. BITS iteratively includes potential variables with the highest posterior probability accounting for the already selected variables. It is implemented by a fast Cholesky update algorithm and is shown to have the screening consistency property. BITS is built based on a model with Gaussian errors, yet, the screening consistency is proved to hold under more general tail conditions. The notion of posterior screening consistency allows the resulting model to provide a good starting point for further Bayesian variable selection methods. A new screening consistent stopping rule based on posterior probability is developed. Simulation studies and real data examples are used to demonstrate scalability and fine screening performance.
We develop a Bayesian variable selection method, called SVEN, based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space. Sparsity is achieved by using degenerate spike priors on inactive variables, whereas Gaussian slab priors are placed on the coefficients for the important predictors making the posterior probability of a model available in explicit form (up to a normalizing constant). The strong model selection consistency is shown to be attained when the number of predictors grows nearly exponentially with the sample size and even when the norm of mean effects solely due to the unimportant variables diverge, which is a novel attractive feature. An appealing byproduct of SVEN is the construction of novel model weight adjusted prediction intervals. Embedding a unique model based screening and using fast Cholesky updates, SVEN produces a highly scalable computational framework to explore gigantic model spaces, rapidly identify the regions of high posterior probabilities and make fast inference and prediction. A temperature schedule guided by our model selection consistency derivations is used to further mitigate multimodal posterior distributions. The performance of SVEN is demonstrated through a number of simulation experiments and a real data example from a genome wide association study with over half a million markers.
Structural breaks have been commonly seen in applications. Specifically for detection of change points in time, research gap still remains on the setting in ultra high dimension, where the covariates may bear spurious correlations. In this paper, we propose a two-stage approach to detect change points in ultra high dimension, by firstly proposing the dynamic titled current correlation screening method to reduce the input dimension, and then detecting possible change points in the framework of group variable selection. Not only the spurious correlation between ultra-high dimensional covariates is taken into consideration in variable screening, but non-convex penalties are studied in change point detection in the ultra high dimension. Asymptotic properties are derived to guarantee the asymptotic consistency of the selection procedure, and the numerical investigations show the promising performance of the proposed approach.
It has become increasingly common to collect high-dimensional binary data; for example, with the emergence of new sampling techniques in ecology. In smaller dimensions, multivariate probit (MVP) models are routinely used for inferences. However, algorithms for fitting such models face issues in scaling up to high dimensions due to the intractability of the likelihood, involving an integral over a multivariate normal distribution having no analytic form. Although a variety of algorithms have been proposed to approximate this intractable integral, these approaches are difficult to implement and/or inaccurate in high dimensions. We propose a two-stage Bayesian approach for inference on model parameters while taking care of the uncertainty propagation between the stages. We use the special structure of latent Gaussian models to reduce the highly expensive computation involved in joint parameter estimation to focus inference on marginal distributions of model parameters. This essentially makes the method embarrassingly parallel for both stages. We illustrate performance in simulations and applications to joint species distribution modeling in ecology.
We develop a fully Bayesian framework for function-on-scalars regression with many predictors. The functional data response is modeled nonparametrically using unknown basis functions, which produces a flexible and data-adaptive functional basis. We incorporate shrinkage priors that effectively remove unimportant scalar covariates from the model and reduce sensitivity to the number of (unknown) basis functions. For variable selection in functional regression, we propose a decision theoretic posterior summarization technique, which identifies a subset of covariates that retains nearly the predictive accuracy of the full model. Our approach is broadly applicable for Bayesian functional regression models, and unlike existing methods provides joint rather than marginal selection of important predictor variables. Computationally scalable posterior inference is achieved using a Gibbs sampler with linear time complexity in the number of predictors. The resulting algorithm is empirically faster than existing frequentist and Bayesian techniques, and provides joint estimation of model parameters, prediction and imputation of functional trajectories, and uncertainty quantification via the posterior distribution. A simulation study demonstrates improvements in estimation accuracy, uncertainty quantification, and variable selection relative to existing alternatives. The methodology is applied to actigraphy data to investigate the association between intraday physical activity and responses to a sleep questionnaire.
We propose a model of brain atrophy as a function of high-dimensional genetic information and low dimensional covariates such as gender, age, APOE gene, and disease status. A nonparametric single-index Bayesian model of high dimension is proposed to model the relationship with B-spline series prior on the unknown functions and Dirichlet process scale mixture of centered normal prior on the distributions of the random effects. The posterior rate of contraction without the random effect is established for a fixed number of regions and time points with increasing sample size. We implement an efficient computation algorithm through a Hamiltonian Monte Carlo (HMC) algorithm. The performance of the proposed Bayesian method is compared with the corresponding least square estimator in the linear model with horseshoe prior, LASSO and SCAD penalization on the high-dimensional covariates. The proposed Bayesian method is applied to a dataset on volumes of brain regions recorded over multiple visits of 748 individuals using 620,901 SNPs and 6 other covariates for each individual, to identify factors associated with brain atrophy.