A method for partitioning the information contained in a protein sequence between its structure and function

60 0 0.0 ( 0 )

Download Cite

Added by Guido Tiana

Publication date 2018

fields Biology

and research's language is English

Authors A. Possenti - M. Vendruscolo - C. Camilloni

Biomolecules

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We developed a computational algorithm to evaluate the amount of information necessary to specify the protein structure, keeping into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the information gap) is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize designed proteins sequences.

rate research

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information

201 - Seonwoo Min , Seunghyun Park , Siwon Kim 2019

Motivation: Bridging the exponentially growing gap between the number of unlabeled and labeled proteins, a couple of works have adopted semi-supervised learning for protein sequence modeling. They pre-train a model with a substantial amount of unlabeled data and transfer the learned representations to various downstream tasks. Nonetheless, the current pre-training methods mostly rely on a language modeling task and often show limited performances. Therefore, a complementary protein-specific task for pre-training is necessary to better capture the information contained within unlabeled protein sequences. Results: In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same family prediction. PLUS can be used to pre-train various model architectures. In this work, we mainly use PLUS to pre-train a recurrent neural network (RNN) and refer to the resulting model as PLUS-RNN. It advances state-of-the-art pre-training methods on six out of seven tasks, i.e., (1) three protein(-pair)-level classification, (2) two protein-level regression, and (3) two amino-acid-level classification tasks. Furthermore, we present results from our ablation studies and interpretation analyses to better understand the strengths of PLUS-RNN. Availability: The codes and pre-trained models are available at https://github.com/mswzeus/PLUS/

Biomolecules Machine Learning Genomics

Network and Sequence-Based Prediction of Protein-Protein Interactions

379 - Luca Becchetti , Adriano Fazzone , Leonardo Martini 2021

Background: Typically, proteins perform key biological functions by interacting with each other. As a consequence, predicting which protein pairs interact is a fundamental problem. Experimental methods are slow, expensive, and may be error prone. Many computational methods have been proposed to identify candidate interacting pairs. When accurate, they can serve as an inexpensive, preliminary filtering stage, to be followed by downstream experimental validation. Among such methods, sequence-based ones are very promising. Results: We present MPS(T&B) (Maximum Protein Similarity Topological and Biological), a new algorithm that leverages both topological and biological information to predict protein-protein interactions. We comprehensively compare MPS(T) and MPS(T&B) with state-of-the-art approaches on reliable PPIs datasets, showing that they have competitive or higher accuracy on biologically validated test sets. Conclusion: MPS(T) and MPS(T&B) are topological only and topological plus sequence-based computational methods that can effectively predict the entire human interactome.

Biomolecules Other Computer Science

A Conformational Search Method for Protein Systems Using Genetic Crossover and Metropolis Criterion

920 - Yoshitake Sakae 2015

Many proteins carry out their biological functions by forming the characteristic tertiary structures. Therefore, the search of the stable states of proteins by molecular simulations is important to understand their functions and stabilities. However, getting the stable state by conformational search is difficult, because the energy landscape of the system is characterized by many local minima separated by high energy barriers. In order to overcome this difficulty, various sampling and optimization methods for conformations of proteins have been proposed. In this study, we propose a new conformational search method for proteins by using genetic crossover and Metropolis criterion. We applied this method to an $alpha$-helical protein. The conformations obtained from the simulations are in good agreement with the experimental results.

Biomolecules Statistical Mechanics Biological Physics

PDBench: Evaluating Computational Methods for Protein Sequence Design

560 - Leonardo V. Castorina , Rokas Petrenas , Katric Subr 2021

Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical challenges facing humanity. This is the purpose of protein design. Sequence design is an important aspect of protein design, and many successful methods to do this have been developed. Recently, deep-learning methods that frame it as a classification problem have emerged as a powerful approach. Beyond their reported improvement in performance, their primary advantage over physics-based methods is that the computational burden is shifted from the user to the developers, thereby increasing accessibility to the design method. Despite this trend, the tools for assessment and comparison of such models remain quite generic. The goal of this paper is to both address the timely problem of evaluation and to shine a spotlight, within the Machine Learning community, on specific assessment criteria that will accelerate impact. We present a carefully curated benchmark set of proteins and propose a number of standard tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility. We compare five existing models with two novel models for sequence prediction. Finally, we test the designs produced by these models with AlphaFold2, a state-of-the-art structure-prediction algorithm, to determine if they are likely to fold into the intended 3D shapes.

Biomolecules Machine Learning

PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction

280 - Nicolas Swenson , Aditi S. Krishnapriyan , Aydin Buluc 2020

Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank in order to study the expressiveness of different structure-based prediction schemes. We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis to capture a complex set of both local and global structural features. While variations of these techniques have been successfully applied to proteins before, we demonstrate that our hybridized approach, PersGNN, outperforms either method on its own as well as a baseline neural network that learns from the same information. PersGNN achieves a 9.3% boost in area under the precision recall curve (AUPR) compared to the best individual model, as well as high F1 scores across different gene ontology categories, indicating the transferability of this approach.

Biomolecules Machine Learning Algebraic Topology