No Arabic abstract
Functional protein-protein interactions are crucial in most cellular processes. They enable multi-protein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are functional interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. Our mutual information-based method also provides signatures of the existence of interactions between protein families. These results stand in contrast with structure prediction of proteins and of multi-protein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins.
Specific protein-protein interactions are crucial in the cell, both to ensure the formation and stability of multi-protein complexes, and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify which proteins are specific interaction partners from sequence data alone. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the three-dimensional structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithms performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from non-interacting ones, using only sequence data.
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We discuss how to distinguish physically interacting proteins from those only sharing evolutionary history.
The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzipping data. Our approach relies on the use of statistical Bayesian inference and of Viterbi decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and analytically. We show how multiple unzippings of the same molecule may be exploited to improve the quality of the prediction, and calculate analytically the number of required unzippings as a function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands.
Energy landscape theory describes how a full-length protein can attain its native fold after sampling only a tiny fraction of all possible structures. Although protein folding is now understood to be concomitant with synthesis on the ribosome there have been few attempts to modify energy landscape theory by accounting for cotranslational folding. This paper introduces a model for cotranslational folding that leads to a natural definition of a nested energy landscape. By applying concepts drawn from submanifold differential geometry the dynamics of protein folding on the ribosome can be explored in a quantitative manner and conditions on the nested potential energy landscapes for a good cotranslational folder are obtained. A generalisation of diffusion rate theory using van Kampens technique of composite stochastic processes is then used to account for entropic contributions and the effects of variable translation rates on cotranslational folding. This stochastic approach agrees well with experimental results and Hamiltionian formalism in the deterministic limit.
Network theory-based approaches provide valuable insights into the variations in global structural connectivity between differing dynamical states of proteins. Our objective is to review network-based analyses to elucidate such variations, especially in the context of subtle conformational changes. We present technical details of the construction and analyses of protein structure networks, encompassing both the non-covalent connectivity and dynamics. We examine the selection of optimal criteria for connectivity based on the physical concept of percolation. We highlight the advantages of using side-chain based network metrics in contrast to backbone measurements. As an illustrative example, we apply the described network approach to investigate the global conformational change between the closed and partially open states of the SARS-CoV-2 spike protein. This conformational change in the spike protein is crucial for coronavirus entry and fusion into human cells. Our analysis reveals global structural reorientations between the two states of the spike protein despite small changes between the two states at the backbone level. We also observe some differences at strategic locations in the structures, correlating with their functions, asserting the advantages of the side-chain network analysis. Finally we present a view of allostery as a subtle synergistic-global change between the ligand and the receptor, the incorporation of which would enhance the drug design strategies.