No Arabic abstract
Engineering the entire genome of an organism enables large-scale changes in organization, function, and external interactions, with significant implications for industry, medicine, and the environment. Improvements to DNA synthesis and organism engineering are already enabling substantial changes to organisms with megabase genomes, such as Escherichia coli and Saccharomyces cerevisiae. Simultaneously, recent advances in genome-scale modeling are increasingly informing the design of metabolic networks. However, major challenges remain for integrating these and other relevant technologies into workflows that can scale to the engineering of gigabase genomes. In particular, we find that a major under-recognized challenge is coordinating the flow of models, designs, constructs, and measurements across the large teams and complex technological systems that will likely be required for gigabase genome engineering. We recommend that the community address these challenges by 1) adopting and extending existing standards and technologies for representing and exchanging information at the gigabase genomic scale, 2) developing new technologies to address major open questions around data curation and quality control, 3) conducting fundamental research on the integration of modeling and design at the genomic scale, and 4) developing new legal and contractual infrastructure to better enable collaboration across multiple institutions.
Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show some notable general features including essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter. Assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be controlled by a variety of (unspecified) probability distribution functions, we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and has a specific logarithmic form. Using the data for 1000+ genomes available to us in early 2010, we find excellent fits to the data over several orders of magnitude, in the linear regime for the Prokaryote data, and the full non-linear form for the Eukaryote data. In their region of overlap the salient features are statistically congruent, which allows us to: interpret the difference between Prokaryotes and Eukaryotes as the manifestation of the increased demand in the biological functions required for the larger Eukaryotes, estimate some minimal genome sizes, and predict a maximal Prokaryote genome size on the order of 8-12 megabasepairs. These results naturally allow a mathematical interpretation in terms of maximal entropy and, therefore, most efficient information transmission.
Being able to store and transmit human genome sequences is an important part in genomic research and industrial applications. The complete human genome has 3.1 billion base pairs (haploid), and storing the entire genome naively takes about 3 GB, which is infeasible for large scale usage. However, human genomes are highly redundant. Any given individuals genome would differ from another individuals genome by less than 1%. There are tools like DNAZip, which express a given genome sequence by only noting down the differences between the given sequence and a reference genome sequence. This allows losslessly compressing the given genome to ~ 4 MB in size. In this work, we demonstrate additional improvements on top of the DNAZip library, where we show an additional ~ 11% compression on top of DNAZips already impressive results. This would allow further savings in disk space and network costs for transmitting human genome sequences.
The problem of the directionality of genome evolution is studied from the information-theoretic view. We propose that the function-coding information quantity of a genome always grows in the course of evolution through sequence duplication, expansion of code, and gene transfer between genomes. The function-coding information quantity of a genome consists of two parts, p-coding information quantity which encodes functional protein and n-coding information quantity which encodes other functional elements except amino acid sequence. The relation of the proposed law to the thermodynamic laws is indicated. The evolutionary trends of DNA sequences revealed by bioinformatics are investigated which afford further evidences on the evolutionary law. It is argued that the directionality of genome evolution comes from species competition adaptive to environment. An expression on the evolutionary rate of genome is proposed that the rate is a function of Darwin temperature (describing species competition) and fitness slope (describing adaptive landscape). Finally, the problem of directly experimental test on the evolutionary directionality is discussed briefly.
We calculate the mutual information function for each of the 24 chromosomes in the human genome. The same correlation pattern is observed regardless the individual functional features of each chromosome. Moreover, correlations of different scale length are detected depicting a multifractal scenario. This fact suggest a unique mechanism of structural evolution. We propose that such a mechanism could be an expansion-modification dynamical system.
Motivation: The rapid growth in genome-wide association studies (GWAS) in plants and animals has brought about the need for a central resource that facilitates i) performing GWAS, ii) accessing data and results of other GWAS, and iii) enabling all users regardless of their background to exploit the latest statistical techniques without having to manage complex software and computing resources. Results: We present easyGWAS, a web platform that provides methods, tools and dynamic visualizations to perform and analyze GWAS. In addition, easyGWAS makes it simple to reproduce results of others, validate findings, and access larger sample sizes through merging of public datasets. Availability: Detailed method and data descriptions as well as tutorials are available in the supplementary materials. easyGWAS is available at http://easygwas.tuebingen.mpg.de/. Contact:
[email protected]