No Arabic abstract
Identifying the secondary structure of an RNA is crucial for understanding its diverse regulatory functions. This paper focuses on how to enhance target identification in a Boltzmann ensemble of structures via chemical probing data. We employ an information-theoretic approach to solve the problem, via considering a variant of the R{e}nyi-Ulam game. Our framework is centered around the ensemble tree, a hierarchical bi-partition of the input ensemble, that is constructed by recursively querying about whether or not a base pair of maximum information entropy is contained in the target. These queries are answered via relating local with global probing data, employing the modularity in RNA secondary structures. We present that leaves of the tree are comprised of sub-samples exhibiting a distinguished structure with high probability. In particular, for a Boltzmann ensemble incorporating probing data, which is well established in the literature, the probability of our framework correctly identifying the target in the leaf is greater than $90%$.
Ribonucleic acid (RNA) is involved in many regulatory and catalytic processes in the cell. The function of any RNA molecule is intimately related with its structure. In-line probing experiments provide valuable structural datasets for a variety of RNAs and are used to characterize conformational changes in riboswitches. However, the structural determinants that lead to differential reactivities in unpaired nucleotides have not been investigated yet. In this work we used a combination of theoretical approaches, i.e., classical molecular dynamics simulations, multiscale quantum mechanical/molecular mechanical calculations, and enhanced sampling techniques in order to compute and interpret the differential reactivity of individual residues in several RNA motifs including members of the most important GNRA and UNCG tetraloop families. Simulations on the multi ns timescale are required to converge the related free-energy landscapes. The results for uGAAAg and cUUCGg tetraloops and double helices are compared with available data from in-line probing experiments and show that the introduced technique is able to distinguish between nucleotides of the uGAAAg tetraloop based on their structural predispositions towards phosphodiester backbone cleavage. For the cUUCGg tetraloop, more advanced ab initio calculations would be required. This study is the first attempt to computationally classify chemical probing experiments and paves the way for an identification of tertiary structures based on the measured reactivity of non-reactive nucleotides.
In this paper we enumerate $k$-noncrossing RNA pseudoknot structures with given minimum stack-length. We show that the numbers of $k$-noncrossing structures without isolated base pairs are significantly smaller than the number of all $k$-noncrossing structures. In particular we prove that the number of 3- and 4-noncrossing RNA structures with stack-length $ge 2$ is for large $n$ given by $311.2470 frac{4!}{n(n-1)...(n-4)}2.5881^n$ and $1.217cdot 10^{7} n^{-{21/2}} 3.0382^n$, respectively. We furthermore show that for $k$-noncrossing RNA structures the drop in exponential growth rates between the number of all structures and the number of all structures with stack-size $ge 2$ increases significantly. Our results are of importance for prediction algorithms for pseudoknot-RNA and provide evidence that there exist neutral networks of RNA pseudoknot structures.
The information content of symbolic sequences (such as nucleic- or amino acid sequences, but also neuronal firings or strings of letters) can be calculated from an ensemble of such sequences, but because information cannot be assigned to single sequences, we cannot correlate information to other observables attached to the sequence. Here we show that an information score obtained from multivariate (multiple-variable) correlations within sequences of a training ensemble can be used to predict observables of out-of-sample sequences with an accuracy that scales with the complexity of correlations, showing that functional information emerges from a hierarchy of multi-variable correlations.
Consistently predicting biopolymer structure at atomic resolution from sequence alone remains a difficult problem, even for small sub-segments of large proteins. Such loop prediction challenges, which arise frequently in comparative modeling and protein design, can become intractable as loop lengths exceed 10 residues and if surrounding side-chain conformations are erased. This article introduces a modeling strategy based on a stepwise ansatz, recently developed for RNA modeling, which posits that any realistic all-atom molecular conformation can be built up by residue-by-residue stepwise enumeration. When harnessed to a dynamic-programming-like recursion in the Rosetta framework, the resulting stepwise assembly (SWA) protocol enables enumerative sampling of a 12 residue loop at a significant but achievable cost of thousands of CPU-hours. In a previously established benchmark, SWA recovers crystallographic conformations with sub-Angstrom accuracy for 19 of 20 loops, compared to 14 of 20 by KIC modeling with a comparable expenditure of computational power. Furthermore, SWA gives high accuracy results on an additional set of 15 loops highlighted in the biological literature for their irregularity or unusual length. Successes include cis-Pro touch turns, loops that pass through tunnels of other side-chains, and loops of lengths up to 24 residues. Remaining problem cases are traced to inaccuracies in the Rosetta all-atom energy function. In five additional blind tests, SWA achieves sub-Angstrom accuracy models, including the first such success in a protein/RNA binding interface, the YbxF/kink-turn interaction in the fourth RNA-puzzle competition. These results establish all-atom enumeration as a systematic approach to protein structure that can leverage high performance computing and physically realistic energy functions to more consistently achieve atomic resolution.
We propose a new topological characterization of RNA secondary structures with pseudoknots based on two topological invariants. Starting from the classic arc-representation of RNA secondary structures, we consider a model that couples both I) the topological genus of the graph and II) the number of crossing arcs of the corresponding primitive graph. We add a term proportional to these topological invariants to the standard free energy of the RNA molecule, thus obtaining a novel free energy parametrization which takes into account the abundance of topologies of RNA pseudoknots observed in RNA databases.