No Arabic abstract
Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. We develop a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. Our model provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. We provide truncation-free MCMC inference for our model using a new auxiliary sampling scheme for Bayesian nonparametric HMMs. In a series of experiments on male X chromosome data from the Thousand Genomes Project and also on data simulated from a population bottleneck we show the benefits of our model over the popular finite model fastPHASE, which can itself be seen as a parametric truncation of our model. We find that the number of HMM states found by our model is correlated with the time to the most recent common ancestor in population bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics applied to large and complex genetic data.
Phylodynamics focuses on the problem of reconstructing past population size dynamics from current genetic samples taken from the population of interest. This technique has been extensively used in many areas of biology, but is particularly useful for studying the spread of quickly evolving infectious diseases agents, e.g., influenza virus. Phylodynamics inference uses a coalescent model that defines a probability density for the genealogy of randomly sampled individuals from the population. When we assume that such a genealogy is known, the coalescent model, equipped with a Gaussian process prior on population size trajectory, allows for nonparametric Bayesian estimation of population size dynamics. While this approach is quite powerful, large data sets collected during infectious disease surveillance challenge the state-of-the-art of Bayesian phylodynamics and demand computationally more efficient inference framework. To satisfy this demand, we provide a computationally efficient Bayesian inference framework based on Hamiltonian Monte Carlo for coalescent process models. Moreover, we show that by splitting the Hamiltonian function we can further improve the efficiency of this approach. Using several simulated and real datasets, we show that our method provides accurate estimates of population size dynamics and is substantially faster than alternative methods based on elliptical slice sampler and Metropolis-adjusted Langevin algorithm.
Full likelihood inference under Kingmans coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) method have been applied successfully. Both methods can be expressed in terms of families of intractable conditional sampling distributions (CSDs), and rely on principled approximations for accurate inference. Recently, more general $Lambda$- and $Xi$-coalescents have been observed to provide better modelling fits to some genetic data sets. We derive families of approximate CSDs for finite sites $Lambda$- and $Xi$-coalescents, and use them to obtain approximately optimal IS and PAC algorithms for $Lambda$-coalescents, yielding substantial gains in efficiency over existing methods.
Multiple imputation has become one of the most popular approaches for handling missing data in statistical analyses. Part of this success is due to Rubins simple combination rules. These give frequentist valid inferences when the imputation and analysis procedures are so called congenial and the complete data analysis is valid, but otherwise may not. Roughly speaking, congeniality corresponds to whether the imputation and analysis models make different assumptions about the data. In practice imputation and analysis procedures are often not congenial, such that tests may not have the correct size and confidence interval coverage deviates from the advertised level. We examine a number of recent proposals which combine bootstrapping with multiple imputation, and determine which are valid under uncongeniality and model misspecification. Imputation followed by bootstrapping generally does not result in valid variance estimates under uncongeniality or misspecification, whereas bootstrapping followed by imputation does. We recommend a particular computationally efficient variant of bootstrapping followed by imputation.
Data from NASAs Orbiting Carbon Observatory-2 (OCO-2) satellite is essential to many carbon management strategies. A retrieval algorithm is used to estimate CO2 concentration using the radiance data measured by OCO-2. However, due to factors such as cloud cover and cosmic rays, the spatial coverage of the retrieval algorithm is limited in some areas of critical importance for carbon cycle science. Mixed land/water pixels along the coastline are also not used in the retrieval processing due to the lack of valid ancillary variables including land fraction. We propose an approach to model spatial spectral data to solve these two problems by radiance imputation and land fraction estimation. The spectral observations are modeled as spatially indexed functional data with footprint-specific parameters and are reduced to much lower dimensions by functional principal component analysis. The principal component scores are modeled as random fields to account for the spatial dependence, and the missing spectral observations are imputed by kriging the principal component scores. The proposed method is shown to impute spectral radiance with high accuracy for observations over the Pacific Ocean. An unmixing approach based on this model provides much more accurate land fraction estimates in our validation study along Greece coastlines.
Distribution network operators (DNOs) are increasingly concerned about the impact of low carbon technologies on the low voltage (LV) networks. More advanced metering infrastructures provide numerous opportunities for more accurate load flow analysis of the LV networks. However, such data may not be readily available for DNOs and in any case is likely to be expensive. Modelling tools are required which can provide realistic, yet accurate, load profiles as input for a network modelling tool, without needing access to large amounts of monitored customer data. In this paper we outline some simple methods for accurately modelling a large number of unmonitored residential customers at the LV level. We do this by a process we call buddying, which models unmonitored customers by assigning them load profiles from a limited sample of monitored customers who have smart meters. Hence the presented method requires access to only a relatively small amount of domestic customers data. The method is efficiently optimised using a genetic algorithm to minimise a weighted cost function between matching the substation data and the individual mean daily demands. Hence we can show the effectiveness of substation monitoring in LV network modelling. Using real LV network modelling, we show that our methods perform significantly better than a comparative Monte Carlo approach, and provide a description of the peak demand behaviour.