No Arabic abstract
Regression models with crossed random effect errors can be very expensive to compute. The cost of both generalized least squares and Gibbs sampling can easily grow as $N^{3/2}$ (or worse) for $N$ observations. Papaspiliopoulos et al. (2020) present a collapsed Gibbs sampler that costs $O(N)$, but under an extremely stringent sampling model. We propose a backfitting algorithm to compute a generalized least squares estimate and prove that it costs $O(N)$. A critical part of the proof is in ensuring that the number of iterations required is $O(1)$ which follows from keeping a certain matrix norm below $1-delta$ for some $delta>0$. Our conditions are greatly relaxed compared to those for the collapsed Gibbs sampler, though still strict. Empirically, the backfitting algorithm has a norm below $1-delta$ under conditions that are less strict than those in our assumptions. We illustrate the new algorithm on a ratings data set from Stitch Fix.
The cost of both generalized least squares (GLS) and Gibbs sampling in a crossed random effects model can easily grow faster than $N^{3/2}$ for $N$ observations. Ghosh et al. (2020) develop a backfitting algorithm that reduces the cost to $O(N)$. Here we extend that method to a generalized linear mixed model for logistic regression. We use backfitting within an iteratively reweighted penalized least square algorithm. The specific approach is a version of penalized quasi-likelihood due to Schall (1991). A straightforward version of Schalls algorithm would also cost more than $N^{3/2}$ because it requires the trace of the inverse of a large matrix. We approximate that quantity at cost $O(N)$ and prove that this substitution makes an asymptotically negligible difference. Our backfitting algorithm also collapses the fixed effect with one random effect at a time in a way that is analogous to the collapsed Gibbs sampler of Papaspiliopoulos et al. (2020). We use a symmetric operator that facilitates efficient covariance computation. We illustrate our method on a real dataset from Stitch Fix. By properly accounting for crossed random effects we show that a naive logistic regression could underestimate sampling variances by several hundred fold.
To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequalities, finite sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.
We develop a novel decouple-recouple dynamic predictive strategy and contribute to the literature on forecasting and economic decision making in a data-rich environment. Under this framework, clusters of predictors generate different latent states in the form of predictive densities that are later synthesized within an implied time-varying latent factor model. As a result, the latent inter-dependencies across predictive densities and biases are sequentially learned and corrected. Unlike sparse modeling and variable selection procedures, we do not assume a priori that there is a given subset of active predictors, which characterize the predictive density of a quantity of interest. We test our procedure by investigating the predictive content of a large set of financial ratios and macroeconomic variables on both the equity premium across different industries and the inflation rate in the U.S., two contexts of topical interest in finance and macroeconomics. We find that our predictive synthesis framework generates both statistically and economically significant out-of-sample benefits while maintaining interpretability of the forecasting variables. In addition, the main empirical results highlight that our proposed framework outperforms both LASSO-type shrinkage regressions, factor based dimension reduction, sequential variable selection, and equal-weighted linear pooling methodologies.
In this paper, we analyze the convergence rate of a collapsed Gibbs sampler for crossed random effects models. Our results apply to a substantially larger range of models than previous works, including models that incorporate missingness mechanism and unbalanced level data. The theoretical tools involved in our analysis include a connection between relaxation time and autoregression matrix, concentration inequalities, and random matrix theory.
This work relates the framework of model-based clustering for spatial functional data where the data are surfaces. We first introduce a Bayesian spatial spline regression model with mixed-effects (BSSR) for modeling spatial function data. The BSSR model is based on Nodal basis functions for spatial regression and accommodates both common mean behavior for the data through a fixed-effects part, and variability inter-individuals thanks to a random-effects part. Then, in order to model populations of spatial functional data issued from heterogeneous groups, we integrate the BSSR model into a mixture framework. The resulting model is a Bayesian mixture of spatial spline regressions with mixed-effects (BMSSR) used for density estimation and model-based surface clustering. The models, through their Bayesian formulation, allow to integrate possible prior knowledge on the data structure and constitute a good alternative to recent mixture of spatial spline regressions model estimated in a maximum likelihood framework via the expectation-maximization (EM) algorithm. The Bayesian model inference is performed by Markov Chain Monte Carlo (MCMC) sampling. We derive two Gibbs sampler to infer the BSSR and the BMSSR models and apply them on simulated surfaces and a real problem of handwritten digit recognition using the MNIST data set. The obtained results highlight the potential benefit of the proposed Bayesian approaches for modeling surfaces possibly dispersed in particular in clusters.