No Arabic abstract
The probability distribution of sequences with maximum entropy that satisfies a given amino acid composition at each site and a given pairwise amino acid frequency at each site pair is a Boltzmann distribution with $exp(-psi_N)$, where the total interaction $psi_N$ is represented as the sum of one body and pairwise interactions. A protein folding theory based on the random energy model (REM) indicates that the equilibrium ensemble of natural protein sequences is a canonical ensemble characterized by $exp(-Delta G_{ND}/k_B T_s)$ or by $exp(- G_{N}/k_B T_s)$ if an amino acid composition is kept constant, meaning $psi_N = Delta G_{ND}/k_B T_s +$ constant, where $Delta G_{ND} equiv G_N - G_D$, $G_N$ and $G_D$ are the native and denatured free energies, and $T_s$ is the effective temperature of natural selection. Here, we examine interaction changes ($Delta psi_N$) due to single nucleotide nonsynonymous mutations, and have found that the variance of their $Delta psi_N$ over all sites hardly depends on the $psi_N$ of each homologous sequence, indicating that the variance of $Delta G_N (= k_B T_s Delta psi_N)$ is nearly constant irrespective of protein families. As a result, $T_s$ is estimated from the ratio of the variance of $Delta psi_N$ to that of a reference protein, which is determined by a direct comparison between $DeltaDelta psi_{ND} (simeq Delta psi_N)$ and experimental $DeltaDelta G_{ND}$. Based on the REM, glass transition temperature $T_g$ and $Delta G_{ND}$ are estimated from $T_s$ and experimental melting temperatures ($T_m$) for 14 protein domains. The estimates of $Delta G_{ND}$ agree well with their experimental values for 5 proteins, and those of $T_s$ and $T_g$ are all within a reasonable range. This method is coarse-grained but much simpler in estimating $T_s$, $T_g$ and $DeltaDelta G_{ND}$ than previous methods.
The common understanding of protein evolution has been that neutral or slightly deleterious mutations are fixed by random drift, and evolutionary rate is determined primarily by the proportion of neutral mutations. However, recent studies have revealed that highly expressed genes evolve slowly because of fitness costs due to misfolded proteins. Here we study selection maintaining protein stability. Protein fitness is taken to be $s = kappa exp(betaDelta G) (1 - exp(betaDeltaDelta G))$, where $s$ and $DeltaDelta G$ are selective advantage and stability change of a mutant protein, $Delta G$ is the folding free energy of the wild-type protein, and $kappa$ represents protein abundance and indispensability. The distribution of $DeltaDelta G$ is approximated to be a bi-Gaussian function, which represents structurally slightly- or highly-constrained sites. Also, the mean of the distribution is negatively proportional to $Delta G$. The evolution of this gene has an equilibrium ($Delta G_e$) of protein stability, the range of which is consistent with experimental values. The probability distribution of $K_a/K_s$, the ratio of nonsynonymous to synonymous substitution rate per site, over fixed mutants in the vicinity of the equilibrium shows that nearly neutral selection is predominant only in low-abundant, non-essential proteins of $Delta G_e > -2.5$ kcal/mol. In the other proteins, positive selection on stabilizing mutations is significant to maintain protein stability at equilibrium as well as random drift on slightly negative mutations, although the average $langle K_a/K_s rangle$ is less than 1. Slow evolutionary rates can be caused by high protein abundance/indispensability, which produces positive shifts of $DeltaDelta G$ through decreasing $Delta G_e$, and by strong structural constraints, which directly make $DeltaDelta G$ more positive.
The twenty protein coding amino acids are found in proteomes with different relative abundances. The most abundant amino acid, leucine, is nearly an order of magnitude more prevalent than the least abundant amino acid, cysteine. Amino acid metabolic costs differ similarly, constraining their incorporation into proteins. On the other hand, sequence diversity is necessary for protein folding, function and evolution. Here we present a simple model for a cost-diversity trade-off postulating that natural proteomes minimize amino acid metabolic flux while maximizing sequence entropy. The model explains the relative abundances of amino acids across a diverse set of proteomes. We found that the data is remarkably well explained when the cost function accounts for amino acid chemical decay. More than one hundred proteomes reach comparable solutions to the trade-off by different combinations of cost and diversity. Quantifying the interplay between proteome size and entropy shows that proteomes can get optimally large and diverse.
We study a continuous-time dynamical system that models the evolving distribution of genotypes in an infinite population where genomes may have infinitely many or even a continuum of loci, mutations accumulate along lineages without back-mutation, added mutations reduce fitness, and recombination occurs on a faster time scale than mutation and selection. Some features of the model, such as existence and uniqueness of solutions and convergence to the dynamical system of an approximating sequence of discrete time models, were presented in earlier work by Evans, Steinsaltz, and Wachter for quite general selective costs. Here we study a special case where the selective cost of a genotype with a given accumulation of ancestral mutations from a wild type ancestor is a sum of costs attributable to each individual mutation plus successive interaction contributions from each $k$-tuple of mutations for $k$ up to some finite ``degree. Using ideas from complex chemical reaction networks and a novel Lyapunov function, we establish that the phenomenon of mutation-selection balance occurs for such selection costs under mild conditions. That is, we show that the dynamical system has a unique equilibrium and that it converges to this equilibrium from all initial conditions.
The role of positive selection in human evolution remains controversial. On the one hand, scans for positive selection have identified hundreds of candidate loci and the genome-wide patterns of polymorphism show signatures consistent with frequent positive selection. On the other hand, recent studies have argued that many of the candidate loci are false positives and that most apparent genome-wide signatures of adaptation are in fact due to reduction of neutral diversity by linked recurrent deleterious mutations, known as background selection. Here we analyze human polymorphism data from the 1,000 Genomes project (Abecasis et al. 2012) and detect signatures of pervasive positive selection once we correct for the effects of background selection. We show that levels of neutral polymorphism are lower near amino acid substitutions, with the strongest reduction observed specifically near functionally consequential amino acid substitutions. Furthermore, amino acid substitutions are associated with signatures of recent adaptation that should not be generated by background selection, such as the presence of unusually long and frequent haplotypes and specific distortions in the site frequency spectrum. We use forward simulations to show that the observed signatures require a high rate of strongly adaptive substitutions in the vicinity of the amino acid changes. We further demonstrate that the observed signatures of positive selection correlate more strongly with the presence of regulatory sequences, as predicted by ENCODE (Gerstein et al. 2012), than the positions of amino acid substitutions. Our results establish that adaptation was frequent in human evolution and provide support for the hypothesis of King and Wilson (King and Wilson 1975) that adaptive divergence is primarily driven by regulatory changes.