No Arabic abstract
We introduce a new method of Bayesian wavelet shrinkage for reconstructing a signal when we observe a noisy version. Rather than making the common assumption that the wavelet coefficients of the signal are independent, we allow for the possibility that they are locally correlated in both location (time) and scale (frequency). This leads us to a prior structure which is analytically intractable, but it is possible to draw independent samples from a close approximation to the posterior distribution by an approach based on Coupling From The Past.
We consider perfect simulation algorithms for locally stable point processes based on dominated coupling from the past, and apply these methods in two different contexts. A new version of the algorithm is developed which is feasible for processes which are neither purely attractive nor purely repulsive. Such processes include multiscale area-interaction processes, which are capable of modelling point patterns whose clustering structure varies across scales. The other topic considered is nonparametric regression using wavelets, where we use a suitable area-interaction process on the discrete space of indices of wavelet coefficients to model the notion that if one wavelet coefficient is non-zero then it is more likely that neighbouring coefficients will be also. A method based on perfect simulation within this model shows promising results compared to the standard methods which threshold coefficients independently.
Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available when the data are very high in dimension. Statistical Significance of Clustering (SigClust) is a recently developed cluster evaluation tool for high dimensional low sample size data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of type-I error, in the important case where there are huge single spikes in the eigenvalues. This paper addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues which leads to a much improved SigClust. These major improvements in SigClust performance are shown by both theoretical work and an extensive simulation study. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.
A large number of statistical models are doubly-intractable: the likelihood normalising term, which is a function of the model parameters, is intractable, as well as the marginal likelihood (model evidence). This means that standard inference techniques to sample from the posterior, such as Markov chain Monte Carlo (MCMC), cannot be used. Examples include, but are not confined to, massive Gaussian Markov random fields, autologistic models and Exponential random graph models. A number of approximate schemes based on MCMC techniques, Approximate Bayesian computation (ABC) or analytic approximations to the posterior have been suggested, and these are reviewed here. Exact MCMC schemes, which can be applied to a subset of doubly-intractable distributions, have also been developed and are described in this paper. As yet, no general method exists which can be applied to all classes of models with doubly-intractable posteriors. In addition, taking inspiration from the Physics literature, we study an alternative method based on representing the intractable likelihood as an infinite series. Unbiased estimates of the likelihood can then be obtained by finite time stochastic truncation of the series via Russian Roulette sampling, although the estimates are not necessarily positive. Results from the Quantum Chromodynamics literature are exploited to allow the use of possibly negative estimates in a pseudo-marginal MCMC scheme such that expectations with respect to the posterior distribution are preserved. The methodology is reviewed on well-known examples such as the parameters in Ising models, the posterior for Fisher-Bingham distributions on the $d$-Sphere and a large-scale Gaussian Markov Random Field model describing the Ozone Column data. This leads to a critical assessment of the strengths and weaknesses of the methodology with pointers to ongoing research.
We use the theory of normal variance-mean mixtures to derive a data augmentation scheme for models that include gamma functions. Our methodology applies to many situations in statistics and machine learning, including Multinomial-Dirichlet distributions, Negative binomial regression, Poisson-Gamma hierarchical models, Extreme value models, to name but a few. All of those models include a gamma function which does not admit a natural conjugate prior distribution providing a significant challenge to inference and prediction. To provide a data augmentation strategy, we construct and develop the theory of the class of Exponential Reciprocal Gamma distributions. This allows scalable EM and MCMC algorithms to be developed. We illustrate our methodology on a number of examples, including gamma shape inference, negative binomial regression and Dirichlet allocation. Finally, we conclude with directions for future research.
Many studies have reported associations between later-life cognition and socioeconomic position in childhood, young adulthood, and mid-life. However, the vast majority of these studies are unable to quantify how these associations vary over time and with respect to several demographic factors. Varying coefficient (VC) models, which treat the covariate effects in a linear model as nonparametric functions of additional effect modifiers, offer an appealing way to overcome these limitations. Unfortunately, state-of-the-art VC modeling methods require computationally prohibitive parameter tuning or make restrictive assumptions about the functional form of the covariate effects. In response, we propose VCBART, which estimates the covariate effects in a VC model using Bayesian Additive Regression Trees. With simple default hyperparameter settings, VCBART outperforms existing methods in terms of covariate effect estimation and prediction. Using VCBART, we predict the cognitive trajectories of 4,167 subjects from the Health and Retirement Study using multiple measures of socioeconomic position and physical health. We find that socioeconomic position in childhood and young adulthood have small effects that do not vary with age. In contrast, the effects of measures of mid-life physical health tend to vary with respect to age, race, and marital status. An R package implementing VC-BART is available at https://github.com/skdeshpande91/VCBART