No Arabic abstract
Coalescent theory combined with statistical modeling allows us to estimate effective population size fluctuations from molecular sequences of individuals sampled from a population of interest. When sequences are sampled serially through time and the distribution of the sampling times depends on the effective population size, explicit statistical modeling of sampling times improves population size estimation. Previous work assumed that the genealogy relating sampled sequences is known and modeled sampling times as an inhomogeneous Poisson process with log-intensity equal to a linear function of the log-transformed effective population size. We improve this approach in two ways. First, we extend the method to allow for joint Bayesian estimation of the genealogy, effective population size trajectory, and other model parameters. Next, we improve the sampling time model by incorporating additional sources of information in the form of time-varying covariates. We validate our new modeling framework using a simulation study and apply our new methodology to analyses of population dynamics of seasonal influenza and to the recent Ebola virus outbreak in West Africa.
SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a viruss evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected. Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants. Using a $k$-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences. Such a clustering method allows us to show the relative proportion of each variant over time, giving the rate of spread of each variant in different locations -- something which is important for vaccine development and distribution. We also compute the importance of each amino acid position of the spike protein in identifying a given variant in terms of information gain. Positions of high variant-specific importance tend to agree with those reported by the USAs Centers for Disease Control and Prevention (CDC), further demonstrating our approach.
We propose a method that uses genetic data to test for the occurrence of a recent range expansion and to infer the location of the origin of the expansion. We introduce a statistic for pairs of populations $psi$ (the directionality index) that detects asymmetries in the two-dimensional allele frequency spectrum caused by the series of founder events that happen during an expansion. Such asymmetry arises because low frequency alleles tend to be lost during founder events, thus creating clines in the frequencies of surviving low-frequency alleles. Using simulations, we further show that $psi$ is more powerful for detecting range expansions than both $F_{ST}$ and clines in heterozygosity. We illustrate the utility of $psi$ by applying it to a data set from modern humans and show how we can include more complicated scenarios such as multiple expansion origins or barriers to migration in the model.
In this paper, new techniques that allow conditional entropy to estimate the combinatorics of symbols are applied to animal communication studies to estimate the communications repertoire size. By using the conditional entropy estimates at multiple orders, the paper estimates the total repertoire sizes for animal communication across bottlenose dolphins, humpback whales, and several species of birds for N-grams length one to three. In addition to discussing the impact of this method on studies of animal communication complexity, the reliability of these estimates is compared to other methods through simulation. While entropy does undercount the total repertoire size due to rare N-grams, it gives a more accurate picture of the most frequently used repertoire than just repertoire size alone.
There is an urgent and well-recognized need to extend genetic studies to diverse populations, but several obstacles continue to be prohibitive, including (but not limited to) the difficulty of recruiting individuals from diverse populations in large numbers and the lack of representation in available genomic references. These obstacles notwithstanding, studying multiple diverse populations would provide informative, population-specific insights. Using Native Hawaiians as an example of an understudied population with a unique evolutionary history, I will argue that by developing key genomic resources and integrating evolutionary thinking into genetic epidemiology, we will have the opportunity to efficiently advance our knowledge of the genetic risk factors, ameliorate health disparity, and improve healthcare in this underserved population.