No Arabic abstract
This paper introduces new efficient algorithms for two problems: sampling conditional on vertex degrees in unweighted graphs, and sampling conditional on vertex strengths in weighted graphs. The algorithms can sample conditional on the presence or absence of an arbitrary number of edges. The resulting conditional distributions provide the basis for exact tests. Existing samplers based on MCMC or sequential importance sampling are generally not scalable; their efficiency degrades in sparse graphs. MCMC methods usually require explicit computation of a Markov basis to navigate the complex state space; this is computationally intensive even for small graphs. We use state-dependent kernel selection to develop new MCMC samplers. These do not require a Markov basis, and are efficient both in sparse and dense graphs. The key idea is to intelligently select a Markov kernel on the basis of the current state of the chain. We apply our methods to testing hypotheses on a real network and contingency table. The algorithms appear orders of magnitude more efficient than existing methods in the test cases considered.
We establish verifiable conditions under which Metropolis Hastings (MH) algorithms with position-dependent proposal covariance matrix will or will not have geometric rate of convergence. Some of the diffusions based MH algorithms like Metropolis adjusted Langevin algorithms (MALA) and Pre-conditioned MALA (PCMALA) have position independent proposal variance. Whereas, for other variants of MALA like manifold MALA (MMALA), the proposal covariance matrix changes in every iteration. Thus, we provide conditions for geometric ergodicity of different variations of Langevin algorithms. These conditions are verified in the context of conditional simulation from the two most popular generalized linear mixed models (GLMMs), namely the binomial GLMM with logit link and the Poisson GLMM with log link. Empirical comparison in the framework of some spatial GLMMs shows that computationally less expensive PCMALA with an appropriately chosen pre-conditioning matrix may outperform MMALA.
Monte Carlo (MC) sampling methods are widely applied in Bayesian inference, system simulation and optimization problems. The Markov Chain Monte Carlo (MCMC) algorithms are a well-known class of MC methods which generate a Markov chain with the desired invariant distribution. In this document, we focus on the Metropolis-Hastings (MH) sampler, which can be considered as the atom of the MCMC techniques, introducing the basic notions and different properties. We describe in details all the elements involved in the MH algorithm and the most relevant variants. Several improvements and recent extensions proposed in the literature are also briefly discussed, providing a quick but exhaustive overview of the current Metropolis-based samplings world.
Thompson sampling is a heuristic algorithm for the multi-armed bandit problem which has a long tradition in machine learning. The algorithm has a Bayesian spirit in the sense that it selects arms based on posterior samples of reward probabilities of each arm. By forging a connection between combinatorial binary bandits and spike-and-slab variable selection, we propose a stochastic optimization approach to subset selection called Thompson Variable Selection (TVS). TVS is a framework for interpretable machine learning which does not rely on the underlying model to be linear. TVS brings together Bayesian reinforcement and machine learning in order to extend the reach of Bayesian subset selection to non-parametric models and large datasets with very many predictors and/or very many observations. Depending on the choice of a reward, TVS can be deployed in offline as well as online setups with streaming data batches. Tailoring multiplay bandits to variable selection, we provide regret bounds without necessarily assuming that the arm mean rewards be unrelated. We show a very strong empirical performance on both simulated and real data. Unlike deterministic optimization methods for spike-and-slab variable selection, the stochastic nature makes TVS less prone to local convergence and thereby more robust.
Determining the number G of components in a finite mixture distribution is an important and difficult inference issue. This is a most important question, because statistical inference about the resulting model is highly sensitive to the value of G. Selecting an erroneous value of G may produce a poor density estimate. This is also a most difficult question from a theoretical perspective as it relates to unidentifiability issues of the mixture model. This is further a most relevant question from a practical viewpoint since the meaning of the number of components G is strongly related to the modelling purpose of a mixture distribution. We distinguish in this chapter between selecting G as a density estimation problem in Section 2 and selecting G in a model-based clustering framework in Section 3. Both sections discuss frequentist as well as Bayesian approaches. We present here some of the Bayesian solutions to the different interpretations of picking the right number of components in a mixture, before concluding on the ill-posed nature of the question.
Under measurement constraints, responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called response-free. We propose a response-free sampling procedure mbox{(OSUMC)} for generalized linear models (GLMs). Using the A-optimality criterion, i.e., the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the Optimal Sampling Under Measurement Constraints (OSUMC) algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives.