ﻻ يوجد ملخص باللغة العربية
Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. We develop a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. Our model provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. We provide truncation-free MCMC inference for our model using a new auxiliary sampling scheme for Bayesian nonparametric HMMs. In a series of experiments on male X chromosome data from the Thousand Genomes Project and also on data simulated from a population bottleneck we show the benefits of our model over the popular finite model fastPHASE, which can itself be seen as a parametric truncation of our model. We find that the number of HMM states found by our model is correlated with the time to the most recent common ancestor in population bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics applied to large and complex genetic data.
Phylodynamics focuses on the problem of reconstructing past population size dynamics from current genetic samples taken from the population of interest. This technique has been extensively used in many areas of biology, but is particularly useful for
Full likelihood inference under Kingmans coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) method have been applied successfully. Both methods can be expressed in t
Multiple imputation has become one of the most popular approaches for handling missing data in statistical analyses. Part of this success is due to Rubins simple combination rules. These give frequentist valid inferences when the imputation and analy
Data from NASAs Orbiting Carbon Observatory-2 (OCO-2) satellite is essential to many carbon management strategies. A retrieval algorithm is used to estimate CO2 concentration using the radiance data measured by OCO-2. However, due to factors such as
Distribution network operators (DNOs) are increasingly concerned about the impact of low carbon technologies on the low voltage (LV) networks. More advanced metering infrastructures provide numerous opportunities for more accurate load flow analysis