No Arabic abstract
In many applications, the dataset under investigation exhibits heterogeneous regimes that are more appropriately modeled using piece-wise linear models for each of the data segments separated by change-points. Although there have been much work on change point linear regression for the low dimensional case, high-dimensional change point regression is severely underdeveloped. Motivated by the analysis of Minnesota House Price Index data, we propose a fully Bayesian framework for fitting changing linear regression models in high-dimensional settings. Using segment-specific shrinkage and diffusion priors, we deliver full posterior inference for the change points and simultaneously obtain posterior probabilities of variable selection in each segment via an efficient Gibbs sampler. Additionally, our method can detect an unknown number of change points and accommodate different variable selection constraints like grouping or partial selection. We substantiate the accuracy of our method using simulation experiments for a wide range of scenarios. We apply our approach for a macro-economic analysis of Minnesota house price index data. The results strongly favor the change point model over a homogeneous (no change point) high-dimensional regression model.
We study high-dimensional Bayesian linear regression with product priors. Using the nascent theory of non-linear large deviations (Chatterjee and Dembo,2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant of the posterior. Furthermore, we establish that under an additional separation condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this separation condition. Finally, we illustrate our results on concrete examples with specific design matrices.
Spike-and-slab priors are popular Bayesian solutions for high-dimensional linear regression problems. Previous theoretical studies on spike-and-slab methods focus on specific prior formulations and use prior-dependent conditions and analyses, and thus can not be generalized directly. In this paper, we propose a class of generic spike-and-slab priors and develop a unified framework to rigorously assess their theoretical properties. Technically, we provide general conditions under which generic spike-and-slab priors can achieve the nearly-optimal posterior contraction rate and the model selection consistency. Our results include those of Narisetty and He (2014) and Castillo et al. (2015) as special cases.
We study high-dimensional regression with missing entries in the covariates. A common strategy in practice is to emph{impute} the missing entries with an appropriate substitute and then implement a standard statistical procedure acting as if the covariates were fully observed. Recent literature on this subject proposes instead to design a specific, often complicated or non-convex, algorithm tailored to the case of missing covariates. We investigate a simpler approach where we fill-in the missing entries with their conditional mean given the observed covariates. We show that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO and square-root LASSO retains the minimax estimation rate in the random-design setting where the covariates are i.i.d. sub-Gaussian. We further show that the square-root LASSO remains emph{pivotal} in this setting. It is often the case that the conditional expectation cannot be computed exactly and must be approximated from data. We study two cases where the covariates either follow an autoregressive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable estimators for the conditional expectation and then perform linear regression via LASSO, and show similar estimation rates in both cases. We complement our theoretical results with simulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our bounds, but also the broader utility of this strategy beyond our theoretical assumptions.
In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for real-time estimation and inference. We propose an online debiased lasso (ODL) method to accommodate the special structure of streaming data. ODL differs from offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from nonlinear online updating with streaming data. We show that the proposed online debiased estimators for the GLMs are consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of the proposed ODL method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. A streaming dataset from the National Automotive Sampling System-Crashworthiness Data System is analyzed to illustrate the application of the proposed method.
Consider the problem of estimating parameters $X^n in mathbb{R}^n $, generated by a stationary process, from $m$ response variables $Y^m = AX^n+Z^m$, under the assumption that the distribution of $X^n$ is known. This is the most general version of the Bayesian linear regression problem. The lack of computationally feasible algorithms that can employ generic prior distributions and provide a good estimate of $X^n$ has limited the set of distributions researchers use to model the data. In this paper, a new scheme called Q-MAP is proposed. The new method has the following properties: (i) It has similarities to the popular MAP estimation under the noiseless setting. (ii) In the noiseless setting, it achieves the asymptotically optimal performance when $X^n$ has independent and identically distributed components. (iii) It scales favorably with the dimensions of the problem and therefore is applicable to high-dimensional setups. (iv) The solution of the Q-MAP optimization can be found via a proposed iterative algorithm which is provably robust to the error (noise) in the response variables.