No Arabic abstract
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.
In this work, we developed an efficient approach to compute ensemble averages in systems with pairwise-additive energetic interactions between the entities. Methods involving full enumeration of the configuration space result in exponential complexity. Sampling methods such as Markov Chain Monte Carlo (MCMC) algorithms have been proposed to tackle the exponential complexity of these problems; however, in certain scenarios where significant energetic coupling exists between the entities, the efficiency of the such algorithms can be diminished. We used a strategy to improve the efficiency of MCMC by taking advantage of the cluster structure in the interaction energy matrix to bias the sampling. We pursued two different schemes for the biased MCMC runs and show that they are valid MCMC schemes. We used both synthesized and real-world systems to show the improved performance of our biased MCMC methods when compared to the regular MCMC method. In particular, we applied these algorithms to the problem of estimating protonation ensemble averages and titration curves of residues in a protein.
The twenty protein coding amino acids are found in proteomes with different relative abundances. The most abundant amino acid, leucine, is nearly an order of magnitude more prevalent than the least abundant amino acid, cysteine. Amino acid metabolic costs differ similarly, constraining their incorporation into proteins. On the other hand, sequence diversity is necessary for protein folding, function and evolution. Here we present a simple model for a cost-diversity trade-off postulating that natural proteomes minimize amino acid metabolic flux while maximizing sequence entropy. The model explains the relative abundances of amino acids across a diverse set of proteomes. We found that the data is remarkably well explained when the cost function accounts for amino acid chemical decay. More than one hundred proteomes reach comparable solutions to the trade-off by different combinations of cost and diversity. Quantifying the interplay between proteome size and entropy shows that proteomes can get optimally large and diverse.
The correlations of primary and secondary structures were analyzed using proteins with known structure from Protein Data Bank. The correlation values of amino acid type and the eight secondary structure types at distant position were calculated for distances between -25 and 25. Shapes of the diagrams indicate that amino acids polarity and capability for hydrogen bonding have influence on the secondary structure at some distances. Clear preference of most of the amino acids towards certain secondary structure type classifies amino acids into four groups: alpha-helix admirers, strand admirers, turn and bend admirers and the others. Group four consists of His and Cis, the amino acids that do not show clear preference for any secondary structure. Amino acids from a group have similar physicochemical properties, and the same structural characteristics. The results suggest that amino acid preference for secondary structure type is based on the structural characteristics at Cb and Cg atoms of amino acid. alpha-helix admirers do not have polar heteroatoms on Cb and Cg atoms, nor branching or aromatic group on Cb atom. Amino acids that have aromatic groups or branching on Cb atom are strand admirers. Turn and bend admirers have polar heteroatom on Cb or Cg atoms or do not have Cb atom at all. Our results indicate that polarity and capability for hydrogen bonding have influence on the secondary structure at some distance, and that amino acid preference for secondary structure is caused by structural properties at Cb or Cg atoms.
Surface-enhanced Raman spectroscopy (SERS) is a sensitive label-free optical method that can provide fingerprint Raman spectra of biomolecules such as DNA, amino acids and proteins. While SERS of single DNA molecule has been recently demonstrated, Raman analysis of single protein sequence was not possible because the SERS spectra of proteins are usually dominated by signals of aromatic amino acid residues. Here, we used electroplasmonic approach to trap single gold nanoparticle in a nanohole for generating a plasmonic nanocavity between the trapped nanoparticle and the nanopore wall. The giant field generated in the nanocavity was so sensitive and localized that it enables SERS discrimination of 10 distinct amino acids at single-molecule level. The obtained spectra are used to analyze the spectra of 2 biomarkers (Vasopressin and Oxytocin) made of a short sequence of 9 amino-acids. Significantly, we demonstrated identification of single non-aromatic amino acid residues in a single short peptide chain as well as discrimination between two peptides with sequences distinguishable in 2 specific amino-acids. Our result demonstrate the high sensitivity of our method to identify single amino acid residue in a protein chain and a potential for further applications in proteomics and single-protein sequencing.
A deep neural network based architecture was constructed to predict amino acid side chain conformation with unprecedented accuracy. Amino acid side chain conformation prediction is essential for protein homology modeling and protein design. Current widely-adopted methods use physics-based energy functions to evaluate side chain conformation. Here, using a deep neural network architecture without physics-based assumptions, we have demonstrated that side chain conformation prediction accuracy can be improved by more than 25%, especially for aromatic residues compared with current standard methods. More strikingly, the prediction method presented here is robust enough to identify individual conformational outliers from high resolution structures in a protein data bank without providing its structural factors. We envisage that our amino acid side chain predictor could be used as a quality check step for future protein structure model validation and many other potential applications such as side chain assignment in Cryo-electron microscopy, crystallography model auto-building, protein folding and small molecule ligand docking.