No Arabic abstract
In the present work, we review the fundamental methods which have been developed in the last few years for classifying into families and clans the distribution of amino acids in protein databases. This is done through functions of random variables, the Entropy Measures of probabilities of occurrence of the amino acids. An intensive study of the Pfam databases is presented with restrictions to families which could be represented by rectangular arrays of amino acids with m rows (protein domains) and n columns (amino acids). This work is also an invitation to scientific research groups worldwide to undertake the statistical analysis with different numbers of rows and columns since we believe in the mathematical characterization of the distribution of amino acids as a fundamental insight on the determination of protein structure and evolution.
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.
The biological function of a protein stems from its 3-dimensional structure, which is thermodynamically determined by the energetics of interatomic forces between its amino acid building blocks (the order of amino acids, known as the sequence, defines a protein). Given the costs (time, money, human resources) of determining protein structures via experimental means such as X-ray crystallography, can we better describe and compare protein 3D structures in a robust and efficient manner, so as to gain meaningful biological insights? We begin by considering a relatively simple problem, limiting ourselves to just protein secondary structural elements. Historically, many computational methods have been devised to classify amino acid residues in a protein chain into one of several discrete secondary structures, of which the most well-characterized are the geometrically regular $alpha$-helix and $beta$-sheet; irregular structural patterns, such as turns and loops, are less understood. Here, we present a study of Deep Learning techniques to classify the loop-like end cap structures which delimit $alpha$-helices. Previous work used highly empirical and heuristic methods to manually classify helix capping motifs. Instead, we use structural data directly--including (i) backbone torsion angles computed from 3D structures, (ii) macromolecular feature sets (e.g., physicochemical properties), and (iii) helix cap classification data (from CAPS-DB)--as the ground truth to train a bidirectional long short-term memory (BiLSTM) model to classify helix cap residues. We tried different network architectures and scanned hyperparameters in order to train and assess several models; we also trained a Support Vector Classifier (SVC) to use as a baseline. Ultimately, we achieved 85% class-balanced accuracy with a deep BiLSTM model.
Background: Typically, proteins perform key biological functions by interacting with each other. As a consequence, predicting which protein pairs interact is a fundamental problem. Experimental methods are slow, expensive, and may be error prone. Many computational methods have been proposed to identify candidate interacting pairs. When accurate, they can serve as an inexpensive, preliminary filtering stage, to be followed by downstream experimental validation. Among such methods, sequence-based ones are very promising. Results: We present MPS(T&B) (Maximum Protein Similarity Topological and Biological), a new algorithm that leverages both topological and biological information to predict protein-protein interactions. We comprehensively compare MPS(T) and MPS(T&B) with state-of-the-art approaches on reliable PPIs datasets, showing that they have competitive or higher accuracy on biologically validated test sets. Conclusion: MPS(T) and MPS(T&B) are topological only and topological plus sequence-based computational methods that can effectively predict the entire human interactome.
The intricate three-dimensional geometries of protein tertiary structures underlie protein function and emerge through a folding process from one-dimensional chains of amino acids. The exact spatial sequence and configuration of amino acids, the biochemical environment and the temporal sequence of distinct interactions yield a complex folding process that cannot yet be easily tracked for all proteins. To gain qualitative insights into the fundamental mechanisms behind the folding dynamics and generic features of the folded structure, we propose a simple model of structure formation that takes into account only fundamental geometric constraints and otherwise assumes randomly paired connections. We find that despite its simplicity, the model results in a network ensemble consistent with key overall features of the ensemble of Protein Residue Networks we obtained from more than 1000 biological protein geometries as available through the Protein Data Base. Specifically, the distribution of the number of interaction neighbors a unit (amino acid) has, the scaling of the structures spatial extent with chain length, the eigenvalue spectrum and the scaling of the smallest relaxation time with chain length are all consistent between model and real proteins. These results indicate that geometric constraints alone may already account for a number of generic features of protein tertiary structures.
We consider multi-chain protein native structures and propose a criterion that determines whether two chains in the system are entangled or not. The criterion is based on the behavior observed by pulling at both temini of each chain simultaneously in the two chains. We have identified about 900 entangled systems in the Protein Data Bank and provided a more detailed analysis for several of them. We argue that entanglement enhances the thermodynamic stability of the system but it may have other functions: burying the hydrophobic residues at the interface, and increasing the DNA or RNA binding area. We also study the folding and stretching properties of the knotted dimeric proteins MJ0366, YibK and bacteriophytochrome. These proteins have been studied theoretically in their monomer