No Arabic abstract
Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices. Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins. The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.
The correlations of primary and secondary structures were analyzed using proteins with known structure from Protein Data Bank. The correlation values of amino acid type and the eight secondary structure types at distant position were calculated for distances between -25 and 25. Shapes of the diagrams indicate that amino acids polarity and capability for hydrogen bonding have influence on the secondary structure at some distances. Clear preference of most of the amino acids towards certain secondary structure type classifies amino acids into four groups: alpha-helix admirers, strand admirers, turn and bend admirers and the others. Group four consists of His and Cis, the amino acids that do not show clear preference for any secondary structure. Amino acids from a group have similar physicochemical properties, and the same structural characteristics. The results suggest that amino acid preference for secondary structure type is based on the structural characteristics at Cb and Cg atoms of amino acid. alpha-helix admirers do not have polar heteroatoms on Cb and Cg atoms, nor branching or aromatic group on Cb atom. Amino acids that have aromatic groups or branching on Cb atom are strand admirers. Turn and bend admirers have polar heteroatom on Cb or Cg atoms or do not have Cb atom at all. Our results indicate that polarity and capability for hydrogen bonding have influence on the secondary structure at some distance, and that amino acid preference for secondary structure is caused by structural properties at Cb or Cg atoms.
In this work it is shown that 20 canonical amino acids (AAs) within genetic code appear to be a whole system with strict AAs positions; more exactly, with AAs ordinal number in three variants; first variant 00-19, second 00-21 and third 00-20. The ordinal number follows from the positions of belonging codons, i.e. their digrams (or doublets). The reading itself is a reading in quaternary numbering system if four bases possess the values within a specific logical square: A = 0, C = 1, G = 2, U = 3. By this, all splittings, distinctions and classifications of AAs appear to be in accordance to atom and nucleon number balance as well as to the other physico-chemical properties, such as hydrophobicity and polarity.
The underlying structure of the canonical amino acid substitution matrix (aaSM) is examined by considering stepwise improvements in the differential recognition of amino acids according to their chemical properties during the branching history of the two aminoacyl-tRNA synthetase (aaRS) superfamilies. The evolutionary expansion of the genetic code is described by a simple parameterization of the aaSM, in which (i) the number of distinguishable amino acid types, (ii) the matrix dimension, and (iii) the number of parameters, each increases by one for each bifurcation in an aaRS phylogeny. Parameterized matrices corresponding to trees in which the size of an amino acid sidechain is the only discernible property behind its categorization as a substrate, exclusively for a Class I or II aaRS, provide a significantly better fit to empirically determined aaSM than trees with random bifurcation patterns. A second split between polar and nonpolar amino acids in each Class effects a vastly greater further improvement. The earliest Class-separated epochs in the phylogenies of the aaRS reflect these enzymes capability to distinguish tRNAs through the recognition of acceptor stem identity elements via the minor (Class I) and major (Class II) helical grooves, which is how the ancient Operational Code functioned. The advent of tRNA recognition using the anticodon loop supports the evolution of the optimal map of amino acid chemistry found in the later Genetic Code, an essentially digital categorization, in which polarity is the major functional property, compensating for the unrefined, haphazard differentiation of amino acids achieved by the Operational Code.
This paper presents, for the first time, four diversity types of protein amino acids. The first type includes two amino acids (G, P), both without standard hydrocarbon side chains; the second one four amino acids, as two pairs [(A, L), (V, I)], all with standard hydrocarbon side chains; the third type comprises the six amino acids, as three pairs [(F, Y), (H, W), (C, M)], two aromatic, two hetero aromatic and two hetero non-aromatic); finally, the fourth type consists of eight amino acids, as four pairs [(S, T), (D, E), (N, Q), (K, R)], all with a functional group which also exists in amino acid functional group (wholly presented: H2N-.CH-COOH; separately: OH, COOH, CONH2, NH2). The insight into existence of four types of diversity was possible only after an insight into the existence of some very new arithmetical regularities, which were so far unknown. Also, as for showing these four types was necessary to reveal the relationships between several key harmonic structures of the genetic code (which we presented in our previous works), this paper is also a review article of the authors researches of the genetic code. By this, the review itself shows that the said harmonic structures are connected through the same (or near the same) chemically determined amino acid pairs, 10 pairs out of the 190 possible.
The twenty protein coding amino acids are found in proteomes with different relative abundances. The most abundant amino acid, leucine, is nearly an order of magnitude more prevalent than the least abundant amino acid, cysteine. Amino acid metabolic costs differ similarly, constraining their incorporation into proteins. On the other hand, sequence diversity is necessary for protein folding, function and evolution. Here we present a simple model for a cost-diversity trade-off postulating that natural proteomes minimize amino acid metabolic flux while maximizing sequence entropy. The model explains the relative abundances of amino acids across a diverse set of proteomes. We found that the data is remarkably well explained when the cost function accounts for amino acid chemical decay. More than one hundred proteomes reach comparable solutions to the trade-off by different combinations of cost and diversity. Quantifying the interplay between proteome size and entropy shows that proteomes can get optimally large and diverse.