No Arabic abstract
Rich information on the prebiotic evolution is still stored in contemporary genomic data. The statistical mechanism at the sequence level may play a significant role in the prebiotic evolution. Based on statistical analysis of genome sequences, it has been observed that there is a close relationship between the evolution of the genetic code and the organisation of genomes. A biodiversity space for species is constructed based on comparing the distributions of codons in genomes for different species according to recruitment order of codons in the prebiotic evolution, by which a closely relationship between the evolution of the genetic code and the tree of life has been confirmed. On one hand, the three domain tree of life can be reconstructed according to the distance matrix of species in this biodiversity space, which supports the three-domain tree rather than the eocyte tree. On the other hand, an evolutionary tree of codons can be obtained by comparing the distributions of the 64 codons in genomes, which agrees with the recruitment order of codons on the roadmap. This is a simple phylogenomic method to study the origins of metazoan, the evolution of primates, etc. This study should be regarded as an exploratory attempt to explain the diversification of the three domains of life by statistical mechanism in prebiotic sequence evolution. It is indicated that the number of bases in the triplet codons might be explained statistically by the number of strands in the triplex DNAs. The adaptation of life to the changing environment might be due to assembly of redundant genomes at the sequence level.
The post-genomic era has brought opportunities to bridge traditionally separate fields of early history of life and brought new insight into origin and evolution of biodiversity. According to distributions of codons in genome sequences, I found a relationship between the genetic code and the tree of life. This remote and profound relationship involves the origin and evolution of the genetic code and the diversification and expansion of genomes. Here, a prebiotic picture of the triplex nucleic acid evolution is proposed to explain the origin of the genetic code, where the transition from disorder to order in the origin of life might be due to the increasing stabilities of triplex base pairs. The codon degeneracy can be obtained in detail based on the coevolution of the genetic code with amino acids, or equivalently, the coevolution of tRNAs with aaRSs. This theory is based on experimental data such as the stability of triplex base pairs and the statistical features of genomic codon distributions. Several experimentally testable proposals have been developed. This study should be regarded as an exploratory attempt to reveal the early evolution of life based on sequence information in a statistical manner.
Based on statistical analysis of the complete genome sequences, a remote relationship has been observed between the evolution of the genetic code and the three domain tree of life. The existence of such a remote relationship need to be explained. The unity of the living system throughout the history of life relies on the common features of life: the homochirality, the genetic code and the universal genome format. The universal genome format has been observed in the genomic codon distributions as a common feature of life at the sequence level. A main aim of this article is to reconstruct and to explain the Phanerozoic biodiversity curve. It has been observed that the exponential growth rate of the Phanerozoic biodiversity curve is about equal to the exponential growth rate of genome size evolution. Hence it is strongly indicated that the expansion of genomes causes the exponential trend of the Phanerozoic biodiversity curve, where the conservative property during the evolution of life is guaranteed by the universal genome format at the sequence level. In addition, a consensus curve based on the climatic and eustatic data is obtained to explain the fluctuations of the Phanerozoic biodiversity curve. Thus, the reconstructed biodiversity curve based on genomic, climatic and eustatic data agrees with Sepkoskis curve based on fossil data. The five mass extinctions can be discerned in this reconstructed biodiversity curve, which indicates a tectonic cause of the mass extinctions. And the declining origination rate and extinction rate throughout the Phanerozoic eon might be due to the growth trend in genome size evolution.
A world-wide COVID-19 pandemic intensified strongly the studies of molecular mechanisms related to the coronaviruses. The origin of coronaviruses and the risks of human-to-human, animal-to-human, and human-to-animal transmission of coronaviral infections can be understood only on a broader evolutionary level by detailed comparative studies. In this paper, we studied ribonucleocapsid assembly-packaging signals (RNAPS) in the genomes of all seven known pathogenic human coronaviruses, SARS-CoV, SARS-CoV-2, MERS-CoV, HCoV-OC43, HCoV-HKU1, HCoV-229E, and HCoV-NL63 and compared them with RNAPS in the genomes of the related animal coronaviruses including SARS-Bat-CoV, MERS-Camel-CoV, MHV, Bat-CoV MOP1, TGEV, and one of camel alphacoronaviruses. RNAPS in the genomes of coronaviruses were evolved due to weakly specific interactions between genomic RNA and N proteins in helical nucleocapsids. Combining transitional genome mapping and Jaccard correlation coefficients allows us to perform the analysis directly in terms of underlying motifs distributed over the genome. In all coronaviruses RNAPS were distributed quasi-periodically over the genome with the period about 54 nt biased to 57 nt and to 51 nt for the genomes longer and shorter than that of SARS-CoV, respectively. The comparison with the experimentally verified packaging signals for MERS-CoV, MHV, and TGEV proved that the distribution of particular motifs is strongly correlated with the packaging signals. We also found that many motifs were highly conserved in both characters and positioning on the genomes throughout the lineages that make them promising therapeutic targets. The mechanisms of encapsidation can affect the recombination and co-infection as well.
The genomic ssRNA of coronaviruses is packaged within a helical nucleocapsid. Due to transitional symmetry of a helix, weakly specific cooperative interaction between ssRNA and nucleocapsid proteins leads to the natural selection of specific quasi-periodic assembly/packaging signals in the related genomic sequence. Such signals coordinated with the nucleocapsid helical structure were detected and reconstructed in the genomes of the coronaviruses SARS-CoV and SARS-CoV-2. The main period of the signals for both viruses was about 54 nt, that implies 6.75 nt per N protein. The complete coverage of ssRNA genome of length about 30,000 nt by the nucleocapsid would need 4,400 N proteins, that makes them the most abundant among the structural proteins. The repertoires of motifs for SARS-CoV and SARS-CoV-2 were divergent but nearly coincided for different isolates of SARS-CoV-2. We obtained the distributions of assembly/packaging signals over the genomes with non-overlapping windows of width 432 nt. Finally, using the spectral entropy, we compared the load from point mutations and indels during virus age for SARS-CoV and SARS-CoV-2. We found the higher mutational load on SARS-CoV. In this sense, SARS-CoV-2 can be treated as a newborn virus. These observations may be helpful in practical medical applications and are of basic interest.
Technical signs of progress during the last decades has led to a situation in which the accumulation of genome sequence data is increasingly fast and cheap. The huge amount of molecular data available nowadays can help addressing new and essential questions in Evolution. However, reconstructing evolution of DNA sequences requires models, algorithms, statistical and computational methods of ever increasing complexity. Since most dramatic genomic changes are caused by genome rearrangements (gene duplications, gain/loss events), it becomes crucial to understand their mechanisms and reconstruct ancestors of the given genomes. This problem is known to be NP-complete even in the simplest case of three genomes. Heuristic algorithms are usually executed to provide approximations of the exact solution. We state that, even if the ancestral reconstruction problem is NP-hard in theory, its exact resolution is feasible in various situations, encompassing organelles and some bacteria. Such accurate reconstruction, which identifies too some highly homoplasic mutations whose ancestral status is undecidable, will be initiated in this work-in-progress, to reconstruct ancestral genomes of two Mycobacterium pathogenetic bacterias. By mixing automatic reconstruction of obvious situations with human interventions on signaled problematic cases, we will indicate that it should be possible to achieve a concrete, complete, and really accurate reconstruction of lineages of the Mycobacterium tuberculosis complex. Thus, it is possible to investigate how these genomes have evolved from their last common ancestors.