No Arabic abstract
In multivariate data analysis, it is often important to estimate a graph characterizing dependence among (p) variables. A popular strategy uses the non-zero entries in a (ptimes p) covariance or precision matrix, typically requiring restrictive modeling assumptions for accurate graph recovery. To improve model robustness, we instead focus on estimating the {em backbone} of the dependence graph. We use a spanning tree likelihood, based on a minimalist graphical model that is purposely overly-simplified. Taking a Bayesian approach, we place a prior on the space of trees and quantify uncertainty in the graphical model. In both theory and experiments, we show that this model does not require the population graph to be a spanning tree or the covariance to satisfy assumptions beyond positive-definiteness. The model accurately recovers the backbone of the population graph at a rate competitive with existing approaches but with better robustness. We show combinatorial properties of the spanning tree, which may be of independent interest, and develop an efficient Gibbs sampler for Bayesian inference. Analyzing electroencephalography data using a Hidden Markov Model with each latent state modeled by a spanning tree, we show that results are much more interpretable compared with popular alternatives.
We present new short proofs of known spanning tree enumeration formulae for threshold and Ferrers graphs by showing that the Laplacian matrices of such graphs admit triangular rank-one perturbations. We then characterize the set of graphs whose Laplacian matrices admit triangular rank-one perturbations as the class of special 2-threshold graphs, introduced by Hung, Kloks, and Villaamil. Our work introduces (1) a new characterization of special 2-threshold graphs that generalizes the characterization of threshold graphs in terms of isolated and dominating vertices, and (2) a spanning tree enumeration formula for special 2-threshold graphs that reduces to the aforementioned formulae for threshold and Ferrers graphs. We consider both unweighted and weighted spanning tree enumeration.
Bayesian methods - either based on Bayes Factors or BIC - are now widely used for model selection. One property that might reasonably be demanded of any model selection method is that if a model ${M}_{1}$ is preferred to a model ${M}_{0}$, when these two models are expressed as members of one model class $mathbb{M}$, this preference is preserved when they are embedded in a different class $mathbb{M}$. However, we illustrate in this paper that with the usual implementation of these common Bayesian procedures this property does not hold true even approximately. We therefore contend that to use these methods it is first necessary for there to exist a natural embedding class. We argue that in any context like the one illustrated in our running example of Bayesian model selection of binary phylogenetic trees there is no such embedding.
In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a gap or elbow in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. This methodological problem is conceptually difficult because, in many situations, there is only enough signal to detect a subset of the $k$ population dimensions/eigenvectors. In this situation, one could argue that the correct choice of $k$ is the number of detectable dimensions. We alleviate these problems with cross-validated eigenvalues. Under a large class of random graph models, without any parametric assumptions, we provide a p-value for each sample eigenvector. It tests the null hypothesis that this sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we prove that our procedure consistently estimates $k$. In simulations and a data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.
Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows users to quantify their uncertainty about the solution. The standard approach to BQ is based on a Gaussian process (GP) approximation of the integrand. As a result, BQ is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting very high-dimensional or non-smooth target functions. This paper proposes to tackle this issue with a new Bayesian numerical integration algorithm based on Bayesian Additive Regression Trees (BART) priors, which we call BART-Int. BART priors are easy to tune and well-suited for discontinuous functions. We demonstrate that they also lend themselves naturally to a sequential design setting and that explicit convergence rates can be obtained in a variety of settings. The advantages and disadvantages of this new methodology are highlighted on a set of benchmark tests including the Genz functions, and on a Bayesian survey design problem.
Estimation of the long-term health effects of air pollution is a challenging task, especially when modelling small-area disease incidence data in an ecological study design. The challenge comes from the unobserved underlying spatial correlation structure in these data, which is accounted for using random effects modelled by a globally smooth conditional autoregressive model. These smooth random effects confound the effects of air pollution, which are also globally smooth. To avoid this collinearity a Bayesian localised conditional autoregressive model is developed for the random effects. This localised model is flexible spatially, in the sense that it is not only able to model step changes in the random effects surface, but also is able to capture areas of spatial smoothness in the study region. This methodological development allows us to improve the estimation performance of the covariate effects, compared to using traditional conditional auto-regressive models. These results are established using a simulation study, and are then illustrated with our motivating study on air pollution and respiratory ill health in Greater Glasgow, Scotland in 2010. The model shows substantial health effects of particulate matter air pollution and income deprivation, whose effects have been consistently attenuated by the currently available globally smooth models.