Assessing variable activity for Bayesian regression trees

65 0 0.0 ( 0 )

Download Cite

Added by Akira Horiguchi

Publication date 2020

fields Mathematical Statistics

and research's language is English

Authors Akira Horiguchi Department of Statistics

Methodology

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Bayesian Additive Regression Trees (BART) are non-parametric models that can capture complex exogenous variable effects. In any regression problem, it is often of interest to learn which variables are most active. Variable activity in BART is usually measured by counting the number of times a tree splits for each variable. Such one-way counts have the advantage of fast computations. Despite their convenience, one-way counts have several issues. They are statistically unjustified, cannot distinguish between main effects and interaction effects, and become inflated when measuring interaction effects. An alternative method well-established in the literature is Sobol indices, a variance-based global sensitivity analysis technique. However, these indices often require Monte Carlo integration, which can be computationally expensive. This paper provides analytic expressions for Sobol indices for BART posterior samples. These expressions are easy to interpret and are computationally feasible. Furthermore, we will show a fascinating connection between first-order (main-effects) Sobol indices and one-way counts. We also introduce a novel ranking method, and use this to demonstrate that the proposed indices preserve the Sobol-based rank order of variable importance. Finally, we compare these methods using analytic test functions and the En-ROADS climate impacts simulator.

rate research

BART: Bayesian additive regression trees

508 - Hugh A. Chipman , Edward I. George , Robert E. McCulloch 2010

We develop a Bayesian sum-of-trees model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BARTs many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.

Methodology Applications Machine Learning

On Soft Bayesian Additive Regression Trees and asynchronous longitudinal regression analysis

93 - Hao Ran , Yang Bai 2021

In many longitudinal studies, the covariate and response are often intermittently observed at irregular, mismatched and subject-specific times. How to deal with such data when covariate and response are observed asynchronously is an often raised problem. Bayesian Additive Regression Trees(BART) is a Bayesian non-Parametric approach which has been shown to be competitive with the best modern predictive methods such as random forest and boosted decision trees. The sum of trees structure combined with a Bayesian inferential framework provide a accurate and robust statistic method. BART variant soft Bayesian Additive Regression Trees(SBART) constructed using randomized decision trees was developed and substantial theoretical and practical benefits were shown. In this paper, we propose a weighted SBART model solution for asynchronous longitudinal data. In comparison to other methods, the current methods are valid under with little assumptions on the covariate process. Extensive simulation studies provide numerical support for this solution. And data from an HIV study is used to illustrate our methodology

Methodology Applications

Nonparametric competing risks analysis using Bayesian Additive Regression Trees (BART)

134 - Rodney Sparapani , Brent R. Logan , Robert E. McCulloch 2018

Many time-to-event studies are complicated by the presence of competing risks. Such data are often analyzed using Cox models for the cause specific hazard function or Fine-Gray models for the subdistribution hazard. In practice regression relationships in competing risks data with either strategy are often complex and may include nonlinear functions of covariates, interactions, high-dimensional parameter spaces and nonproportional cause specific or subdistribution hazards. Model misspecification can lead to poor predictive performance. To address these issues, we propose a novel approach to flexible prediction modeling of competing risks data using Bayesian Additive Regression Trees (BART). We study the simulation performance in two-sample scenarios as well as a complex regression setting, and benchmark its performance against standard regression techniques as well as random survival forests. We illustrate the use of the proposed method on a recently published study of patients undergoing hematopoietic stem cell transplantation.

Methodology Applications

Bayesian Variable Selection for Linear Regression with the $kappa$-$G$ Priors

536 - Zichen Ma , Ernest Fokoue 2015

In this paper, we introduce a new methodology for Bayesian variable selection in linear regression that is independent of the traditional indicator method. A diagonal matrix $mathbf{G}$ is introduced to the prior of the coefficient vector $boldsymbol{beta}$, with each of the $g_j$s, bounded between $0$ and $1$, on the diagonal serves as a stabilizer of the corresponding $beta_j$. Mathematically, a promising variable has a $g_j$ value that is close to $0$, whereas the value of $g_j$ corresponding to an unpromising variable is close to $1$. This property is proven in this paper under orthogonality together with other asymptotic properties. Computationally, the sample path of each $g_j$ is obtained through Metropolis-within-Gibbs sampling method. Also, in this paper we give two simulations to verify the capability of this methodology in variable selection.

Methodology

Bayesian sparse multiple regression for simultaneous rank reduction and variable selection

120 - Antik Chakraborty , Anirban Bhattacharya , Bani K. Mallick 2016

We develop a Bayesian methodology aimed at simultaneously estimating low-rank and row-sparse matrices in a high-dimensional multiple-response linear regression model. We consider a carefully devised shrinkage prior on the matrix of regression coefficients which obviates the need to specify a prior on the rank, and shrinks the regression matrix towards low-rank and row-sparse structures. We provide theoretical support to the proposed methodology by proving minimax optimality of the posterior mean under the prediction risk in ultra-high dimensional settings where the number of predictors can grow sub-exponentially relative to the sample size. A one-step post-processing scheme induced by group lasso penalties on the rows of the estimated coefficient matrix is proposed for variable selection, with default choices of tuning parameters. We additionally provide an estimate of the rank using a novel optimization function achieving dimension reduction in the covariate space. We exhibit the performance of the proposed methodology in an extensive simulation study and a real data example.

Methodology Statistics Theory Statistics Theory