No Arabic abstract
By performing a comprehensive study on 1832 segments of 1212 complete genomes of viruses, we show that in viral genomes the hairpin structures of thermodynamically predicted RNA secondary structures are more abundant than expected under a simple random null hypothesis. The detected hairpin structures of RNA secondary structures are present both in coding and in noncoding regions for the four groups of viruses categorized as dsDNA, dsRNA, ssDNA and ssRNA. For all groups hairpin structures of RNA secondary structures are detected more frequently than expected for a random null hypothesis in noncoding rather than in coding regions. However, potential RNA secondary structures are also present in coding regions of dsDNA group. In fact we detect evolutionary conserved RNA secondary structures in conserved coding and noncoding regions of a large set of complete genomes of dsDNA herpesviruses.
Our work is concerned with the generation and targeted design of RNA, a type of genetic macromolecule that can adopt complex structures which influence their cellular activities and functions. The design of large scale and complex biological structures spurs dedicated graph-based deep generative modeling techniques, which represents a key but underappreciated aspect of computational drug discovery. In this work, we investigate the principles behind representing and generating different RNA structural modalities, and propose a flexible framework to jointly embed and generate these molecular structures along with their sequence in a meaningful latent space. Equipped with a deep understanding of RNA molecular structures, our most sophisticated encoding and decoding methods operate on the molecular graph as well as the junction tree hierarchy, integrating strong inductive bias about RNA structural regularity and folding mechanism such that high structural validity, stability and diversity of generated RNAs are achieved. Also, we seek to adequately organize the latent space of RNA molecular embeddings with regard to the interaction with proteins, and targeted optimization is used to navigate in this latent space to search for desired novel RNA molecules.
Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to a deeper understanding of microbial pathogenesis. We have considered all complete SARS-CoV-2 genomes deposited in the GISAID repository until textbf{four} different cut-off dates, and used Direct Coupling Analysis together with an assumption of Quasi-Linkage Equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find textbf{eight} interactions, of which three between pairs where one locus lies in gene ORF3a, both loci holding non-synonymous mutations. We also find interactions between two loci in gene nsp13, both holding non-synonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether we infer interactions between loci in viral genes ORF3a and nsp2, nsp12 and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13 and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.
Cell type (e.g. pluripotent cell, fibroblast) is the end result of many complex processes that unfold due to evolutionary, developmental, and transformational stimuli. A cells phenotype and the discrete, a priori states that define various cell subtypes (e.g. skin fibroblast, embryonic stem cell) are ultimately part of a continuum that may predict changes and systematic variation in cell subtypes. These features can be both observable in existing cellular states and hypothetical (e.g. unobserved). In this paper, a series of approaches will be used to approximate the continuous diversity of gene expression across a series of pluripotent, totipotent, and fibroblast cellular subtypes. We will use a series of previously-collected datasets and analyze them using three complementary approaches: the computation of distances based on the subsampling of diversity, assessing the separability of individual genes for a specific cell line both within and between cell types, and a hierarchical soft classification technique that will assign a membership value for specific genes in specific cell types given a number of different criteria. These approaches will allow us to assess the observed gene-expression diversity in these datasets, as well as assess how well a priori cell types characterize their constituent populations. In conclusion, the application of these findings to a broader biological context will be discussed.
This paper develops a formulation of the quasispecies equations appropriate for polysomic, semiconservatively replicating genomes. This paper is an extension of previous work on the subject, which considered the case of haploid genomes. Here, we develop a more general formulation of the quasispecies equations that is applicable to diploid and even polyploid genomes. Interestingly, with an appropriate classification of population fractions, we obtain a system of equations that is formally identical to the haploid case. As with the work for haploid genomes, we consider both random and immortal DNA strand chromosome segregation mechanisms. However, in contrast to the haploid case, we have found that an analytical solution for the mean fitness is considerably more difficult to obtain for the polyploid case. Accordingly, whereas for the haploid case we obtained expressions for the mean fitness for the case of an analogue of the single-fitness-peak landscape for arbitrary lesion repair probabilities (thereby allowing for non-complementary genomes), here we solve for the mean fitness for the restricted case of perfect lesion repair.
Given a random RNA secondary structure, $S$, we study RNA sequences having fixed ratios of nuclotides that are compatible with $S$. We perform this analysis for RNA secondary structures subject to various base pairing rules and minimum arc- and stack-length restrictions. Our main result reads as follows: in the simplex of the nucleotide ratios there exists a convex region in which, in the limit of long sequences, a random structure a.a.s.~has compatible sequence with these ratios and outside of which a.a.s.~a random structure has no such compatible sequence. We localize this region for RNA secondary structures subject to various base pairing rules and minimum arc- and stack-length restrictions. In particular, for {bf GC}-sequences having a ratio of {bf G} nucleotides smaller than $1/3$, a random RNA secondary structure without any minimum arc- and stack-length restrictions has a.a.s.~no such compatible sequence. For sequences having a ratio of {bf G} nucleotides larger than $1/3$, a random RNA secondary structure has a.a.s. such compatible sequences. We discuss our results in the context of various families of RNA structures.