We study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp PAC-Bayesian risk bounds for aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We then apply these results to derive sparsity oracle inequalities.
Shrinkage prior are becoming more and more popular in Bayesian modeling for high dimensional sparse problems due to its computational efficiency. Recent works show that a polynomially decaying prior leads to satisfactory posterior asymptotics under regression models. In the literature, statisticians have investigated how the global shrinkage parameter, i.e., the scale parameter, in a heavy tail prior affects the posterior contraction. In this work, we explore how the shape of the prior, or more specifically, the polynomial order of the prior tail affects the posterior. We discover that, under the sparse normal means models, the polynomial order does affect the multiplicative constant of the posterior contraction rate. More importantly, if the polynomial order is sufficiently close to 1, it will induce the optimal Bayesian posterior convergence, in the sense that the Bayesian contraction rate is sharply minimax, i.e., not only the order, but also the multiplicative constant of the posterior contraction rate are optimal. The above Bayesian sharp minimaxity holds when the global shrinkage parameter follows a deterministic choice which depends on the unknown sparsity $s$. Therefore, a Beta-prior modeling is further proposed, such that our sharply minimax Bayesian procedure is adaptive to unknown $s$. Our theoretical discoveries are justified by simulation studies.
We establish exponential bounds for the hypergeometric distribution which include a finite sampling correction factor, but are otherwise analogous to bounds for the binomial distribution due to Leon and Perron (2003) and Talagrand (1994). We also establish a convex ordering for sampling without replacement from populations of real numbers between zero and one: a population of all zeros or ones (and hence yielding a hypergeometric distribution in the upper bound) gives the extreme case.
Datasets displaying temporal dependencies abound in science and engineering applications, with Markov models representing a simplified and popular view of the temporal dependence structure. In this paper, we consider Bayesian settings that place prior distributions over the parameters of the transition kernel of a Markov model, and seeks to characterize the resulting, typically intractable, posterior distributions. We present a PAC-Bayesian analysis of variational Bayes (VB) approximations to tempered Bayesian posterior distributions, bounding the model risk of the VB approximations. Tempered posteriors are known to be robust to model misspecification, and their variational approximations do not suffer the usual problems of over confident approximations. Our results tie the risk bounds to the mixing and ergodic properties of the Markov data generating model. We illustrate the PAC-Bayes bounds through a number of example Markov models, and also consider the situation where the Markov model is misspecified.
This paper considers Bayesian multiple testing under sparsity for polynomial-tailed distributions satisfying a monotone likelihood ratio property. Included in this class of distributions are the Students t, the Pareto, and many other distributions. We prove some general asymptotic optimality results under fixed and random thresholding. As examples of these general results, we establish the Bayesian asymptotic optimality of several multiple testing procedures in the literature for appropriately chosen false discovery rate levels. We also show by simulation that the Benjamini-Hochberg procedure with a false discovery rate level different from the asymptotically optimal one can lead to high Bayes risk.
Gaussian Processes (GPs) are a generic modelling tool for supervised learning. While they have been successfully applied on large datasets, their use in safety-critical applications is hindered by the lack of good performance guarantees. To this end, we propose a method to learn GPs and their sparse approximations by directly optimizing a PAC-Bayesian bound on their generalization performance, instead of maximizing the marginal likelihood. Besides its theoretical appeal, we find in our evaluation that our learning method is robust and yields significantly better generalization guarantees than other common GP approaches on several regression benchmark datasets.