No Arabic abstract
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we analyzed novel high-quality genome sequences of three gray wolves, one from each of three putative centers of dog domestication, two ancient dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. We find dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow, which confounds previous inferences of dog origins. In dogs, the domestication bottleneck was severe involving a 17 to 49-fold reduction in population size, a much stronger bottleneck than estimated previously from less intensive sequencing efforts. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was far larger than represented by modern wolf populations. Conditional on mutation rate, we narrow the plausible range for the date of initial dog domestication to an interval from 11 to 16 thousand years ago. This period predates the rise of agriculture, implying that the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that surprisingly, none of the extant wolf lineages from putative domestication centers are more closely related to dogs, and the sampled wolves instead form a sister monophyletic clade. This result, in combination with our finding of dog-wolf admixture during the process of domestication, suggests a re-evaluation of past hypotheses of dog origin is necessary. Finally, we also detect signatures of selection, including evidence for selection on genes implicated in morphology, metabolism, and neural development. Uniquely, we find support for selective sweeps at regulatory sites suggesting gene regulatory changes played a critical role in dog domestication.
Efficient text indexing data structures have enabled large-scale genomic sequence analysis and are used to help solve problems ranging from assembly to read mapping. However, these data structures typically assume that the underlying reference text is static and will not change over the course of the queries being made. Some progress has been made in exploring how certain text indices, like the suffix array, may be updated, rather than rebuilt from scratch, when the underlying reference changes. Yet, these update operations can be complex in practice, difficult to implement, and give fairly pessimistic worst-case bounds. We present a novel data structure, SkipPatch, for maintaining a k-mer-based index over a dynamically changing genome. SkipPatch pairs a hash-based k-mer index with an indexable skip list that is used to efficiently maintain the set of edits that have been applied to the original genome. SkipPatch is practically fast, significantly outperforming the dynamic extended suffix array in terms of update and query speed.
Recent genetic studies and whole-genome sequencing projects have greatly improved our understanding of human variation and clinically actionable genetic information. Smaller ethnic populations, however, remain underrepresented in both individual and large-scale sequencing efforts and hence present an opportunity to discover new variants of biomedical and demographic significance. This report describes the sequencing and analysis of a genome obtained from an individual of Serbian origin, introducing tens of thousands of previously unknown variants to the currently available pool. Ancestry analysis places this individual in close proximity of the Central and Eastern European populations; i.e., closest to Croatian, Bulgarian and Hungarian individuals and, in terms of other Europeans, furthest from Ashkenazi Jewish, Spanish, Sicilian, and Baltic individuals. Our analysis confirmed gene flow between Neanderthal and ancestral pan-European populations, with similar contributions to the Serbian genome as those observed in other European groups. Finally, to assess the burden of potentially disease-causing/clinically relevant variation in the sequenced genome, we utilized manually curated genotype-phenotype association databases and variant-effect predictors. We identified several variants that have previously been associated with severe early-onset disease that is not evident in the proband, as well as variants that could yet prove to be clinically relevant to the proband over the next decades. The presence of numerous private and low-frequency variants along with the observed and predicted disease-causing mutations in this genome exemplify some of the global challenges of genome interpretation, especially in the context of understudied ethnic groups.
We report a droplet microfluidic method to target and sort individual cells directly from complex microbiome samples, and to prepare these cells for bulk whole genome sequencing without cultivation. We characterize this approach by recovering bacteria spiked into human stool samples at a ratio as low as 1:250 and by successfully enriching endogenous Bacteroides vulgatus to the level required for de-novo assembly of high-quality genomes. While microbiome strains are increasingly demanded for biomedical applications, the vast majority of species and strains are uncultivated and without reference genomes. We address this shortcoming by encapsulating complex microbiome samples directly into microfluidic droplets and amplify a target-specific genomic fragment using a custom molecular TaqMan probe. We separate those positive droplets by droplet sorting, selectively enriching single target strain cells. Finally, we present a protocol to purify the genomic DNA while specifically removing amplicons and cell debris for high-quality genome sequencing.
With the development of high throughput sequencing technology, it becomes possible to directly analyze mutation distribution in a genome-wide fashion, dissociating mutation rate measurements from the traditional underlying assumptions. Here, we sequenced several genomes of Escherichia coli from colonies obtained after chemical mutagenesis and observed a strikingly nonrandom distribution of the induced mutations. These include long stretches of exclusively G to A or C to T transitions along the genome and orders of magnitude intra- and inter-genomic differences in mutation density. Whereas most of these observations can be explained by the known features of enzymatic processes, the others could reflect stochasticity in the molecular processes at the single-cell level. Our results demonstrate how analysis of the molecular records left in the genomes of the descendants of an individual mutagenized cell allows for genome-scale observations of fixation and segregation of mutations, as well as recombination events, in the single genome of their progenitor.
Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show some notable general features including essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter. Assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be controlled by a variety of (unspecified) probability distribution functions, we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and has a specific logarithmic form. Using the data for 1000+ genomes available to us in early 2010, we find excellent fits to the data over several orders of magnitude, in the linear regime for the Prokaryote data, and the full non-linear form for the Eukaryote data. In their region of overlap the salient features are statistically congruent, which allows us to: interpret the difference between Prokaryotes and Eukaryotes as the manifestation of the increased demand in the biological functions required for the larger Eukaryotes, estimate some minimal genome sizes, and predict a maximal Prokaryote genome size on the order of 8-12 megabasepairs. These results naturally allow a mathematical interpretation in terms of maximal entropy and, therefore, most efficient information transmission.