No Arabic abstract
This paper investigates the integration of gradient boosted decision trees and varying coefficient models. We introduce the tree boosted varying coefficient framework which justifies the implementation of decision tree boosting as the nonparametric effect modifiers in varying coefficient models. This framework requires no structural assumptions in the space containing the varying coefficient covariates, is easy to implement, and keeps a balance between model complexity and interpretability. To provide statistical guarantees, we prove the asymptotic consistency of the proposed method under the regression settings with $L^2$ loss. We further conduct a thorough empirical study to show that the proposed method is capable of providing accurate predictions as well as intelligible visual explanations.
Nonparametric varying coefficient (NVC) models are useful for modeling time-varying effects on responses that are measured repeatedly. In this paper, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian estimation and variable selection in NVC models. The NVC-SSL simultaneously selects and estimates the significant varying coefficients, while also accounting for temporal correlations. Our model can be implemented using a computationally efficient expectation-maximization (EM) algorithm. We also employ a simple method to make our model robust to misspecification of the temporal correlation structure. In contrast to frequentist approaches, little is known about the large-sample properties for Bayesian NVC models when the dimension of the covariates $p$ grows much faster than sample size $n$. In this paper, we derive posterior contraction rates for the NVC-SSL model when $p gg n$ under both correct specification and misspecification of the temporal correlation structure. Thus, our results are derived under weaker assumptions than those seen in other high-dimensional NVC models which assume independent and identically distributed (iid) random errors. Finally, we illustrate our methodology through simulation studies and data analysis. Our method is implemented in the publicly available R package NVCSSL.
In this paper, we propose a varying coefficient panel data model with unobservable multiple interactive fixed effects that are correlated with the regressors. We approximate each coefficient function by B-spline, and propose a robust nonlinear iteration scheme based on the least squares method to estimate the coefficient functions of interest. We also establish the asymptotic theory of the resulting estimators under certain regularity assumptions, including the consistency, the convergence rate and the asymptotic distribution. Furthermore, we develop a least squares dummy variable method to study an important special case of the proposed model: the varying coefficient panel data model with additive fixed effects. To construct the pointwise confidence intervals for the coefficient functions, a residual-based block bootstrap method is proposed to reduce the computational burden as well as to avoid the accumulative errors. Simulation studies and a real data analysis are also carried out to assess the performance of our proposed methods.
Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows users to quantify their uncertainty about the solution. The standard approach to BQ is based on a Gaussian process (GP) approximation of the integrand. As a result, BQ is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting very high-dimensional or non-smooth target functions. This paper proposes to tackle this issue with a new Bayesian numerical integration algorithm based on Bayesian Additive Regression Trees (BART) priors, which we call BART-Int. BART priors are easy to tune and well-suited for discontinuous functions. We demonstrate that they also lend themselves naturally to a sequential design setting and that explicit convergence rates can be obtained in a variety of settings. The advantages and disadvantages of this new methodology are highlighted on a set of benchmark tests including the Genz functions, and on a Bayesian survey design problem.
High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated. The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an efficient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposals practical advantages in simulation and in applications in finance and biology.
Modern microbiome compositional data are often high-dimensional and exhibit complex dependency among microbial taxa. However, existing approaches to analyzing microbiome compositional data either do not adequately account for the complex dependency or lack scalability to high-dimensionality, which presents challenges in appropriately incorporating the random effects in microbiome compositions in the resulting statistical analysis. We introduce a generative model called the logistic-tree normal (LTN) model to address this need. The LTN marries two popular classes of models -- the log-ratio normal (LN) and the Dirichlet-tree (DT) -- and inherits key benefits of each. LN models are flexible in characterizing covariance among taxa but lacks scalability to higher dimensions; DT avoids this issue through a tree-based binomial decomposition but incurs restrictive covariance. The LTN incorporates the tree-based decomposition as the DT does, but it jointly models the corresponding binomial probabilities using a (multivariate) logistic-normal distribution as in LN models. It therefore allows rich covariance structures as LN, along with computational efficiency realized through a Polya-Gamma augmentation on the binomial models at the tree nodes. Accordingly, Bayesian inference on LTN can readily proceed by Gibbs sampling. The LTN also allows common techniques for effective inference on high-dimensional data -- such as those based on sparsity and low-rank assumptions in the covariance structure -- to be readily incorporated. Depending on the goal of the analysis, LTN can be used either as a standalone model or embedded into more sophisticated hierarchical models. We demonstrate its use in estimating taxa covariance and in mixed-effects modeling. Finally, we carry out an extensive case study using an LTN-based mixed-effects model to analyze a longitudinal dataset from the DIABIMMUNE project.