No Arabic abstract
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets are available, and that an iterative pairing algorithm (IPA) allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if its quality is imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We discuss how to distinguish physically interacting proteins from those only sharing evolutionary history.
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of true LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.
Problems of search and recognition appear over different scales in biological systems. In this review we focus on the challenges posed by interactions between proteins, in particular transcription factors, and DNA and possible mechanisms which allow for a fast and selective target location. Initially we argue that DNA-binding proteins can be classified, broadly, into three distinct classes which we illustrate using experimental data. Each class calls for a different search process and we discuss the possible application of different search mechanisms proposed over the years to each class. The main thrust of this review is a new mechanism which is based on barrier discrimination. We introduce the model and analyze in detail its consequences. It is shown that this mechanism applies to all classes of transcription factors and can lead to a fast and specific search. Moreover, it is shown that the mechanism has interesting transient features which allow for stability at the target despite rapid binding and unbinding of the transcription factor from the target.
Intrinsically disordered proteins (IDPs) do not possess well-defined three-dimensional structures in solution under physiological conditions. We develop all-atom, united-atom, and coarse-grained Langevin dynamics simulations for the IDP alpha-synuclein that include geometric, attractive hydrophobic, and screened electrostatic interactions and are calibrated to the inter-residue separations measured in recent smFRET experiments. We find that alpha-synuclein is disordered with conformational statistics that are intermediate between random walk and collapsed globule behavior. An advantage of calibrated molecular simulations over constraint methods is that physical forces act on all residues, not only on residue pairs that are monitored experimentally, and these simulations can be used to study oligomerization and aggregation of multiple alpha-synuclein proteins that may precede amyloid formation.
Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects of progressive rezipping of the solution-phase protein onto DNA sites liberated by unzipping of the originally bound protein; (2) that a model in which solution-phase proteins bind non-specifically to DNA can describe experiments on exchanges between the non specific DNA- binding proteins Fis-Fis and Fis-HU; (3) that a binding specific model describes experiments on the exchange of CueR proteins on specific binding sites.