No Arabic abstract
Considerable effort has been directed to developing asymptotically minimax procedures in problems of recovering functions and densities. These methods often rely on somewhat arbitrary and restrictive assumptions such as isotropy or spatial homogeneity. This work enhances theoretical understanding of Bayesian forests (including BART) under substantially relaxed smoothness assumptions. In particular, we provide a comprehensive study of asymptotic optimality and posterior contraction of Bayesian forests when the regression function has anisotropic smoothness that possibly varies over the function domain. We introduce a new class of sparse piecewise heterogeneous anisotropic H{o}lder functions and derive their minimax rate of estimation in high-dimensional scenarios under the $L_2$ loss. Next, we find that the default Bayesian CART prior, coupled with a subset selection prior for sparse estimation in high-dimensional scenarios, adapts to unknown heterogeneous smoothness and sparsity. These results show that Bayesian forests are uniquely suited for more general estimation problems which would render other default machine learning tools, such as Gaussian processes, suboptimal. Beyond nonparametric regression, we also show that Bayesian forests can be successfully applied to many other problems including density estimation and binary classification.
Since their inception in the 1980s, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has been missing. In this paper, we take a step towards understanding why/when do Bayesian trees and their ensembles not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that regression trees (and their ensembles) (a) are capable of recovering smooth regression surfaces, achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when p>n. These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice.
Few methods in Bayesian non-parametric statistics/ machine learning have received as much attention as Bayesian Additive Regression Trees (BART). While BART is now routinely performed for prediction tasks, its theoretical properties began to be understood only very recently. In this work, we continue the theoretical investigation of BART initiated by Rockova and van der Pas (2017). In particular, we study the Bernstein-von Mises (BvM) phenomenon (i.e. asymptotic normality) for smooth linear functionals of the regression surface within the framework of non-parametric regression with fixed covariates. As with other adaptive priors, the BvM phenomenon may fail when the regularities of the functional and the truth are not compatible. To overcome the curse of adaptivity under hierarchical priors, we induce a self-similarity assumption to ensure convergence towards a single Gaussian distribution as opposed to a Gaussian mixture. Similar qualitative restrictions on the functional parameter are known to be necessary for adaptive inference. Many machine learning methods lack coherent probabilistic mechanisms for gauging uncertainty. BART readily provides such quantification via posterior credible sets. The BvM theorem implies that the credible sets are also confidence regions with the same asymptotic coverage. This paper presents the first asymptotic normality result for BART priors, providing another piece of evidence that BART is a valid tool from a frequentist point of view.
We prove uniform consistency of Random Survival Forests (RSF), a newly introduced forest ensemble learner for analysis of right-censored survival data. Consistency is proven under general splitting rules, bootstrapping, and random selection of variables--that is, under true implementation of the methodology. A key assumption made is that all variables are factors. Although this assumes that the feature space has finite cardinality, in practice the space can be a extremely large--indeed, current computational procedures do not properly deal with this setting. An indirect consequence of this work is the introduction of new computational methodology for dealing with factors with unlimited number of labels.
For estimating a lower bounded location or mean parameter for a symmetric and logconcave density, we investigate the frequentist performance of the $100(1-alpha)%$ Bayesian HPD credible set associated with priors which are truncations of flat priors onto the restricted parameter space. Various new properties are obtained. Namely, we identify precisely where the minimum coverage is obtained and we show that this minimum coverage is bounded between $1-frac{3alpha}{2}$ and $1-frac{3alpha}{2}+frac{alpha^2}{1+alpha}$; with the lower bound $1-frac{3alpha}{2}$ improving (for $alpha leq 1/3$) on the previously established ([9]; [8]) lower bound $frac{1-alpha}{1+alpha}$. Several illustrative examples are given.
This paper deals with a new Bayesian approach to the standard one-sample $z$- and $t$- tests. More specifically, let $x_1,ldots,x_n$ be an independent random sample from a normal distribution with mean $mu$ and variance $sigma^2$. The goal is to test the null hypothesis $mathcal{H}_0: mu=mu_1$ against all possible alternatives. The approach is based on using the well-known formula of the Kullbak-Leibler divergence between two normal distributions (sampling and hypothesized distributions selected in an appropriate way). The change of the distance from a priori to a posteriori is compared through the relative belief ratio (a measure of evidence). Eliciting the prior, checking for prior-data conflict and bias are also considered. Many theoretical properties of the procedure have been developed. Besides its simplicity, and unlike the classical approach, the new approach possesses attractive and distinctive features such as giving evidence in favor of the null hypothesis. It also avoids several undesirable paradoxes, such as Lindleys paradox that may be encountered by some existing Bayesian methods. The use of the approach has been illustrated through several examples.