No Arabic abstract
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.
Boltzmann machines (BM) are widely used as generative models. For example, pairwise Potts models (PM), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and in generating new functional sequences. However, the resulting PM suffers from important over-fitting effects: many couplings are small, noisy and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than $90%$ of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.
In the present work, we review the fundamental methods which have been developed in the last few years for classifying into families and clans the distribution of amino acids in protein databases. This is done through functions of random variables, the Entropy Measures of probabilities of occurrence of the amino acids. An intensive study of the Pfam databases is presented with restrictions to families which could be represented by rectangular arrays of amino acids with m rows (protein domains) and n columns (amino acids). This work is also an invitation to scientific research groups worldwide to undertake the statistical analysis with different numbers of rows and columns since we believe in the mathematical characterization of the distribution of amino acids as a fundamental insight on the determination of protein structure and evolution.
Models of protein energetics which neglect interactions between amino acids that are not adjacent in the native state, such as the Go model, encode or underlie many influential ideas on protein folding. Implicit in this simplification is a crucial assumption that has never been critically evaluated in a broad context: Detailed mechanisms of protein folding are not biased by non-native contacts, typically imagined as a consequence of sequence design and/or topology. Here we present, using computer simulations of a well-studied lattice heteropolymer model, the first systematic test of this oft-assumed correspondence over the statistically significant range of hundreds of thousands of amino acid sequences, and a concomitantly diverse set of folding pathways. Enabled by a novel means of fingerprinting folding trajectories, our study reveals a profound insensitivity of the order in which native contacts accumulate to the omission of non-native interactions. Contrary to conventional thinking, this robustness does not arise from topological restrictions and does not depend on folding rate. We find instead that the crucial factor in discriminating among topological pathways is the heterogeneity of native contact energies. Our results challenge conventional thinking on the relationship between sequence design and free energy landscapes for protein folding, and help justify the widespread use of Go-like models to scrutinize detailed folding mechanisms of real proteins.
We combined the genetic crossover, which is one of the operations of genetic algorithm, and replica-exchange method in parallel molecular dynamics simulations. The genetic crossover and replica-exchange method can search the global conformational space by exchanging the corresponding parts between a pair of conformations of a protein. In this study, we applied this method to an $alpha$-helical protein, Trp-cage mini protein, which has 20 amino-acid residues. The conformations obtained from the simulations are in good agreement with the experimental results.
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We discuss how to distinguish physically interacting proteins from those only sharing evolutionary history.