No Arabic abstract
Freshwater Unionid bivalves have recently faced ecological upheaval through pollution, barriers to dispersal, human harvesting, and changes in fish-host prevalence. Currently, over 70% of species are threatened, endangered or extinct. To characterize the genetic response to these recent selective pressures, we collected population genetic data for one successful bivalve species, Megalonaias nervosa. We identify megabase sized regions that are nearly monomorphic across the population, a signal of strong, recent selection reshaping genetic diversity. These signatures of selection encompass a total of 73Mb, greater response to selection than is commonly seen in population genetic models. We observe 102 duplicate genes with high dN/dS on terminal branches among regions with sweeps, suggesting that gene duplication is a causative mechanism of recent adaptation in M. nervosa. Genes in sweeps reflect functional classes known to be important for Unionid survival, including anticoagulation genes important for fish host parasitization, detox genes, mitochondria management, and shell formation. We identify selective sweeps in regions with no known functional impacts, suggesting mechanisms of adaptation that deserve greater attention in future work on species survival. In contrast, polymorphic transposable element insertions appear to be detrimental and underrepresented among regions with sweeps. TE site frequency spectra are skewed toward singleton variants, and TEs among regions with sweeps are present only at low frequency. Our work suggests that duplicate genes are an essential source of genetic novelty that has helped this successful species succeed in environments where others have struggled. These results suggest that gene duplications deserve greater attention in non-model population genomics, especially in species that have recently faced sudden environmental challenges.
The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2,257 Europeans (the POPRES dataset) to conduct one of the first surveys of recent genealogical ancestry over the past three thousand years at a continental scale. We detected 1.9 million shared genomic segments, and used the lengths of these to infer the distribution of shared ancestors across time and geography. We find that a pair of modern Europeans living in neighboring populations share around 10-50 genetic common ancestors from the last 1500 years, and upwards of 500 genetic ancestors from the previous 1000 years. These numbers drop off exponentially with geographic distance, but since genetic ancestry is rare, individuals from opposite ends of Europe are still expected to share millions of common genealogical ancestors over the last 1000 years. There is substantial regional variation in the number of shared genetic ancestors: especially high numbers of common ancestors between many eastern populations likely date to the Slavic and/or Hunnic expansions, while much lower levels of common ancestry in the Italian and Iberian peninsulas may indicate weaker demographic effects of Germanic expansions into these areas and/or more stably structured populations. Recent shared ancestry in modern Europeans is ubiquitous, and clearly shows the impact of both small-scale migration and large historical events. Population genomic datasets have considerable power to uncover recent demographic history, and will allow a much fuller picture of the close genealogical kinship of individuals across the world.
Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.
In this work it is shown that 20 canonical amino acids (AAs) within genetic code appear to be a whole system with strict distinction in Genetic Code Table (GCT) into some different quantums: 20, 23, 61 amino acid molecules. These molecules distinction is followed by specific balanced atom number and/or nucleon number distinctions within those molecules. In this second version two appendices are added; also a new version of Periodic system of numbers, whose first verson is given in arXiv:1107.1998 [q-bio.OT].
In most of the recent immunological literature the differences across antigen receptor populations are examined via non-parametric statistical measures of species overlap and diversity borrowed from ecological studies. While this approach is robust in a wide range of situations, it seems to provide little insight into the underlying clonal size distribution and the overall mechanism differentiating the receptor populations. As a possible alternative, the current paper presents a parametric method which adjusts for the data under-sampling as well as provides a unifying approach to simultaneous comparison of multiple receptor groups by means of the modern statistical tools of unsupervised learning. The parametric model is based on a flexible multivariate Poisson-lognormal distribution and is seen to be a natural generalization of the univariate Poisson-lognormal models used in ecological studies of biodiversity patterns. The procedure for evaluating models fit is described along with the public domain software developed to perform the necessary diagnostics. The model-driven analysis is seen to compare favorably vis a vis traditional methods when applied to the data from T-cell receptors in transgenic mice populations.
Genetic studies of human traits have revolutionized our understanding of the variation between individuals, and opened the door for numerous breakthroughs in biology, medicine and other scientific fields. And yet, the ultimate promise of this area of research is still not fully realized. In this review, we highlight the major open problems that need to be solved to improve our understanding of the genetic variation underlying human traits, and by discussing these challenges provide a primer to the field. Our focus is on concrete analytical problems, both conceptual and technical in nature. We cover general issues in genetic studies such as population structure, epistasis and gene-environment interactions, data-related issues such as ethnic diversity and rare genetic variants, and specific challenges related to heritability estimates, genetic association studies and polygenic risk scores. We emphasize the interconnectedness of these open problems and suggest promising avenues to address them.