No Arabic abstract
Specific protein-protein interactions are crucial in the cell, both to ensure the formation and stability of multi-protein complexes, and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify which proteins are specific interaction partners from sequence data alone. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the three-dimensional structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithms performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from non-interacting ones, using only sequence data.
Functional protein-protein interactions are crucial in most cellular processes. They enable multi-protein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are functional interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. Our mutual information-based method also provides signatures of the existence of interactions between protein families. These results stand in contrast with structure prediction of proteins and of multi-protein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins.
In spite of decades of research, much remains to be discovered about folding: the detailed structure of the initial (unfolded) state, vestigial folding instructions remaining only in the unfolded state, the interaction of the molecule with the solvent, instantaneous power at each point within the molecule during folding, the fact that the process is stable in spite of myriad possible disturbances, potential stabilization of trajectory by chaos, and, of course, the exact physical mechanism (code or instructions) by which the folding process is specified in the amino acid sequence. Simulations based upon microscopic physics have had some spectacular successes and continue to improve, particularly as super-computer capabilities increase. The simulations, exciting as they are, are still too slow and expensive to deal with the enormous number of molecules of interest. In this paper, we introduce an approximate model based upon physics, empirics, and information science which is proposed for use in machine learning applications in which very large numbers of sub-simulations must be made. In particular, we focus upon machine learning applications in the learning phase and argue that our model is sufficiently close to the physics that, in spite of its approximate nature, can facilitate stepping through machine learning solutions to explore the mechanics of folding mentioned above. We particularly emphasize the exploration of energy flow (power) within the molecule during folding, the possibility of energy scale invariance (above a threshold), vestigial information in the unfolded state as attractive targets for such machine language analysis, and statistical analysis of an ensemble of folding micro-steps.
A model to describe the mechanism of conformational dynamics in secondary protein based on matter interactions is proposed. The approach deploys the lagrangian method by imposing certain symmetry breaking. The protein backbone is initially assumed to be nonlinear and represented by the Sine-Gordon equation, while the nonlinear external bosonic sources is represented by $phi^4$ interaction. It is argued that the nonlinear source induces the folding pathway in a different way than the previous work with initially linear backbone. Also, the nonlinearity of protein backbone decreases the folding speed.
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We discuss how to distinguish physically interacting proteins from those only sharing evolutionary history.
Processes that proceed reliably from a variety of initial conditions to a unique final form, regardless of moderately changing conditions, are of obvious importance in biophysics. Protein folding is a case in point. We show that the action principle can be applied directly to study the stability of biological processes. The action principle in classical physics starts with the first variation of the action and leads immediately to the equations of motion. The second variation of the action leads in a natural way to powerful theorems that provide quantitative treatment of stability and focusing and also explain how some very complex processes can behave as though some seemingly important forces drop out. We first apply these ideas to the non-equilibrium states involved in two-state folding. We treat torsional waves and use the action principle to talk about critical points in the dynamics. For some proteins the theory resembles TST. We reach several quantitative and qualitative conclusions. Besides giving an explanation of why TST often works in folding, we find that the apparent smoothness of the energy funnel is a natural consequence of the putative critical points in the dynamics. These ideas also explain why biological proteins fold to unique states and random polymers do not. The insensitivity to perturbations which follows from the presence of critical points explains how folding to a unique shape occurs in the presence of dilute denaturing agents in spite of the fact that those agents disrupt the folded structure of the native state. This paper contributes to the theoretical armamentarium by directing attention to the logical progression from first physical principles to the stability theorems related to catastrophe theory as applied to folding. This can potentially have the same success in biophysics as it has enjoyed in optics.