No Arabic abstract
Consistently predicting biopolymer structure at atomic resolution from sequence alone remains a difficult problem, even for small sub-segments of large proteins. Such loop prediction challenges, which arise frequently in comparative modeling and protein design, can become intractable as loop lengths exceed 10 residues and if surrounding side-chain conformations are erased. This article introduces a modeling strategy based on a stepwise ansatz, recently developed for RNA modeling, which posits that any realistic all-atom molecular conformation can be built up by residue-by-residue stepwise enumeration. When harnessed to a dynamic-programming-like recursion in the Rosetta framework, the resulting stepwise assembly (SWA) protocol enables enumerative sampling of a 12 residue loop at a significant but achievable cost of thousands of CPU-hours. In a previously established benchmark, SWA recovers crystallographic conformations with sub-Angstrom accuracy for 19 of 20 loops, compared to 14 of 20 by KIC modeling with a comparable expenditure of computational power. Furthermore, SWA gives high accuracy results on an additional set of 15 loops highlighted in the biological literature for their irregularity or unusual length. Successes include cis-Pro touch turns, loops that pass through tunnels of other side-chains, and loops of lengths up to 24 residues. Remaining problem cases are traced to inaccuracies in the Rosetta all-atom energy function. In five additional blind tests, SWA achieves sub-Angstrom accuracy models, including the first such success in a protein/RNA binding interface, the YbxF/kink-turn interaction in the fourth RNA-puzzle competition. These results establish all-atom enumeration as a systematic approach to protein structure that can leverage high performance computing and physically realistic energy functions to more consistently achieve atomic resolution.
Atomic-accuracy structure prediction of macromolecules is a long-sought goal of computational biophysics. Accurate modeling should be achievable by optimizing a physically realistic energy function but is presently precluded by incomplete sampling of a biopolymers many degrees of freedom. We present herein a working hypothesis, called the stepwise ansatz, for recursively constructing well-packed atomic-detail models in small steps, enumerating several million conformations for each monomer and covering all build-up paths. By implementing the strategy in Rosetta and making use of high-performance computing, we provide first tests of this hypothesis on a benchmark of fifteen RNA loop modeling problems drawn from riboswitches, ribozymes, and the ribosome, including ten cases that were not solvable by prior knowledge based modeling approaches. For each loop problem, this deterministic stepwise assembly (SWA) method either reaches atomic accuracy or exposes flaws in Rosettas all-atom energy function, indicating the resolution of the conformational sampling bottleneck. To our knowledge, SWA is the first enumerative, ab initio build-up method to systematically outperform existing Monte Carlo and knowledge-based methods for 3D structure prediction. As a rigorous experimental test, we have applied SWA to a small RNA motif of previously unknown structure, the C7.2 tetraloop/tetraloop-receptor, and stringently tested this blind prediction with nucleotide-resolution structure mapping data.
Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions. Due to the limitation of the previous database, especially the lack of protein structure data, most of the existing computational methods rely heavily on the sequence data, with only a small portion of the methods utilizing the structural information. Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years. In this work, we give a thorough review of this field, surveying both the binding site and binding preference prediction problems and covering the commonly used datasets, features, and models. We also point out the potential challenges and opportunities in this field. This survey summarizes the development of the RBP-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
Three-dimensional RNA models fitted into crystallographic density maps exhibit pervasive conformational ambiguities, geometric errors and steric clashes. To address these problems, we present enumerative real-space refinement assisted by electron density under Rosetta (ERRASER), coupled to Python-based hierarchical environment for integrated xtallography (PHENIX) diffraction-based refinement. On 24 data sets, ERRASER automatically corrects the majority of MolProbity-assessed errors, improves the average Rfree factor, resolves functionally important discrepancies in noncanonical structure and refines low-resolution models to better match higher-resolution models.
We present a novel topological classification of RNA secondary structures with pseudoknots. It is based on the topological genus of the circular diagram associated to the RNA base-pair structure. The genus is a positive integer number, whose value quantifies the topological complexity of the folded RNA structure. In such a representation, planar diagrams correspond to pure RNA secondary structures and have zero genus, whereas non planar diagrams correspond to pseudoknotted structures and have higher genus. We analyze real RNA structures from the databases wwPDB and Pseudobase, and classify them according to their topological genus. We compare the results of our statistical survey with existing theoretical and numerical models. We also discuss possible applications of this classification and show how it can be used for identifying new RNA structural motifs.
Background: Typically, proteins perform key biological functions by interacting with each other. As a consequence, predicting which protein pairs interact is a fundamental problem. Experimental methods are slow, expensive, and may be error prone. Many computational methods have been proposed to identify candidate interacting pairs. When accurate, they can serve as an inexpensive, preliminary filtering stage, to be followed by downstream experimental validation. Among such methods, sequence-based ones are very promising. Results: We present MPS(T&B) (Maximum Protein Similarity Topological and Biological), a new algorithm that leverages both topological and biological information to predict protein-protein interactions. We comprehensively compare MPS(T) and MPS(T&B) with state-of-the-art approaches on reliable PPIs datasets, showing that they have competitive or higher accuracy on biologically validated test sets. Conclusion: MPS(T) and MPS(T&B) are topological only and topological plus sequence-based computational methods that can effectively predict the entire human interactome.