No Arabic abstract
Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to a deeper understanding of microbial pathogenesis. We have considered all complete SARS-CoV-2 genomes deposited in the GISAID repository until textbf{four} different cut-off dates, and used Direct Coupling Analysis together with an assumption of Quasi-Linkage Equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find textbf{eight} interactions, of which three between pairs where one locus lies in gene ORF3a, both loci holding non-synonymous mutations. We also find interactions between two loci in gene nsp13, both holding non-synonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether we infer interactions between loci in viral genes ORF3a and nsp2, nsp12 and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13 and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.
The novel coronavirus SARS-CoV-2, which emerged in late 2019, has since spread around the world infecting tens of millions of people with coronavirus disease 2019 (COVID-19). While this viral species was unknown prior to January 2020, its similarity to other coronaviruses that infect humans has allowed for rapid insight into the mechanisms that it uses to infect human hosts, as well as the ways in which the human immune system can respond. Here, we contextualize SARS-CoV-2 among other coronaviruses and identify what is known and what can be inferred about its behavior once inside a human host. Because the genomic content of coronaviruses, which specifies the viruss structure, is highly conserved, early genomic analysis provided a significant head start in predicting viral pathogenesis. The pathogenesis of the virus offers insights into symptomatology, transmission, and individual susceptibility. Additionally, prior research into interactions between the human immune system and coronaviruses has identified how these viruses can evade the immune systems protective mechanisms. We also explore systems-level research into the regulatory and proteomic effects of SARS-CoV-2 infection and the immune response. Understanding the structure and behavior of the virus serves to contextualize the many facets of the COVID-19 pandemic and can influence efforts to control the virus and treat the disease.
The genomic ssRNA of coronaviruses is packaged within a helical nucleocapsid. Due to transitional symmetry of a helix, weakly specific cooperative interaction between ssRNA and nucleocapsid proteins leads to the natural selection of specific quasi-periodic assembly/packaging signals in the related genomic sequence. Such signals coordinated with the nucleocapsid helical structure were detected and reconstructed in the genomes of the coronaviruses SARS-CoV and SARS-CoV-2. The main period of the signals for both viruses was about 54 nt, that implies 6.75 nt per N protein. The complete coverage of ssRNA genome of length about 30,000 nt by the nucleocapsid would need 4,400 N proteins, that makes them the most abundant among the structural proteins. The repertoires of motifs for SARS-CoV and SARS-CoV-2 were divergent but nearly coincided for different isolates of SARS-CoV-2. We obtained the distributions of assembly/packaging signals over the genomes with non-overlapping windows of width 432 nt. Finally, using the spectral entropy, we compared the load from point mutations and indels during virus age for SARS-CoV and SARS-CoV-2. We found the higher mutational load on SARS-CoV. In this sense, SARS-CoV-2 can be treated as a newborn virus. These observations may be helpful in practical medical applications and are of basic interest.
CovID-19 genetics analysis is critical to determine virus type,virus variant and evaluate vaccines. In this paper, SARS-Cov-2 RNA sequence analysis relative to region or territory is investigated. A uniform framework of sequence SVM model with various genetics length from short to long and mixed-bases is developed by projecting SARS-Cov-2 RNA sequence to different dimensional space, then scoring it according to the output probability of pre-trained SVM models to explore the territory or origin information of SARS-Cov-2. Different sample size ratio of training set and test set is also discussed in the data analysis. Two SARS-Cov-2 RNA classification tasks are constructed based on GISAID database, one is for mainland, Hongkong and Taiwan of China, and the other is a 6-class classification task (Africa, Asia, Europe, North American, South American& Central American, Ocean) of 7 continents. For 3-class classification of China, the Top-1 accuracy rate can reach 82.45% (train 60%, test=40%); For 2-class classification of China, the Top-1 accuracy rate can reach 97.35% (train 80%, test 20%); For 6-class classification task of world, when the ratio of training set and test set is 20% : 80% , the Top-1 accuracy rate can achieve 30.30%. And, some Top-N results are also given.
With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants. In this paper, we use spike sequences to classify different variants of the coronavirus in humans. We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples ($1%$ of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USAs Centers for Disease Control and Prevention (CDC).
By performing a comprehensive study on 1832 segments of 1212 complete genomes of viruses, we show that in viral genomes the hairpin structures of thermodynamically predicted RNA secondary structures are more abundant than expected under a simple random null hypothesis. The detected hairpin structures of RNA secondary structures are present both in coding and in noncoding regions for the four groups of viruses categorized as dsDNA, dsRNA, ssDNA and ssRNA. For all groups hairpin structures of RNA secondary structures are detected more frequently than expected for a random null hypothesis in noncoding rather than in coding regions. However, potential RNA secondary structures are also present in coding regions of dsDNA group. In fact we detect evolutionary conserved RNA secondary structures in conserved coding and noncoding regions of a large set of complete genomes of dsDNA herpesviruses.