We calculate the mutual information function for each of the 24 chromosomes in the human genome. The same correlation pattern is observed regardless the individual functional features of each chromosome. Moreover, correlations of different scale length are detected depicting a multifractal scenario. This fact suggest a unique mechanism of structural evolution. We propose that such a mechanism could be an expansion-modification dynamical system.
Next-generation sequencing technology enables routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole genome sequencing-based genotyping pipelines are available for practical applications. Here, we present the whole genome sequencing-based single nucleotide polymorphism (SNP) genotyping method and apply to the evolutionary analysis of methicillin-resistant Staphylococcus aureus. The SNP genotyping method calls genome variants using next-generation sequencing reads of whole genomes and calculates the pair-wise Jaccard distances of the genome variants. The method may reveal the high-resolution whole genome SNP profiles and the structural variants of different isolates of methicillin-resistant S. aureus (MRSA) and methicillin-susceptible S. aureus (MSSA) strains. The phylogenetic analysis of whole genomes and particular regions may monitor and track the evolution and the transmission dynamic of bacterial pathogens. The computer programs of the whole genome sequencing-based SNP genotyping method are available to the public at https://github.com/cyinbox/NGS.
The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa, an important human pathogen, against 4 antibiotics. Our results demonstrate that extremely sparse models which are biologically relevant can be learnt using this approach.
The Dissertation is focused on the studies of associations between functional elements in human genome and their nucleotide structure. The asymmetry in nucleotide content (skew, bias) was chosen as the main feature for nucleotide structure. A significant difference in nucleotide content asymmetry was found for human exons vs. introns. Specifically, exon sequences display bias for purines (i.e., excess of A and G over C and T), while introns exhibit keto-amino skew (i.e. excess of G and T over A and C). The extents of these biases depend upon gene expression patterns. The highest intronic keto-amino skew is found in the introns of housekeeping genes. In the case of introns, whose sequences are under weak repair system, the AT->GC and CG->TA substitutions are preferentially accumulated. A comparative analysis of gene sequences encoding cytochrome P450 2E1 of Homo sapiens and representative mammals was done. The cladistic tree on the basis of coding sequences similarity of the gene Cyp2e1 was constructed. A new programming tools of NCBI database sequence mining and analysis was developed, resulting in construction of a own database.
Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show some notable general features including essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter. Assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be controlled by a variety of (unspecified) probability distribution functions, we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and has a specific logarithmic form. Using the data for 1000+ genomes available to us in early 2010, we find excellent fits to the data over several orders of magnitude, in the linear regime for the Prokaryote data, and the full non-linear form for the Eukaryote data. In their region of overlap the salient features are statistically congruent, which allows us to: interpret the difference between Prokaryotes and Eukaryotes as the manifestation of the increased demand in the biological functions required for the larger Eukaryotes, estimate some minimal genome sizes, and predict a maximal Prokaryote genome size on the order of 8-12 megabasepairs. These results naturally allow a mathematical interpretation in terms of maximal entropy and, therefore, most efficient information transmission.
Understanding the relationship between genomic variation and variation in phenotypes for quantitative traits such as physiology, yield, fitness or behavior, will provide important insights for both predicting adaptive evolution and for breeding schemes. A particular question is whether the genetic variation that influences quantitative phenotypes is typically the result of one or two mutations of large effect, or multiple mutations of small effect. In this paper we explore this issue using the wild model legume Medicago truncatula. We show that phenotypes, such as quantitative disease resistance, can be well-predicted using genome-wide patterns of admixture, from which it follows that there must be many mutations of small effect. Our findings prove the potential of our novel whole-genome modeling -WhoGEM- method and experimentally validate, for the first time, the infinitesimal model as a mechanism for adaptation of quantitative phenotypes in plants. This insight can accelerate breeding and biomedicine research programs.