No Arabic abstract
Position-specific scoring matrices (PSSMs) are useful for detecting weak homology in protein sequence analysis, and they are thought to contain some essential signatures of the protein families. In order to elucidate what kind of ingredients constitute such family-specific signatures, we apply singular value decomposition to a set of PSSMs and examine the properties of dominant right and left singular vectors. The first right singular vectors were correlated with various amino acid indices including relative mutability, amino acid composition in protein interior, hydropathy, or turn propensity, depending on proteins. A significant correlation between the first left singular vector and a measure of site conservation was observed. It is shown that the contribution of the first singular component to the PSSMs act to disfavor potentially but falsely functionally important residues at conserved sites. The second right singular vectors were highly correlated with hydrophobicity scales, and the corresponding left singular vectors with contact numbers of protein structures. It is suggested that sequence alignment with a PSSM is essentially equivalent to threading supplemented with functional information. The presented method may be used to separate functionally important sites from structurally important ones, and thus it may be a useful tool for predicting protein functions.
The flexibility in gap cost enjoyed by Hidden Markov Models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance.
In the present work, we review the fundamental methods which have been developed in the last few years for classifying into families and clans the distribution of amino acids in protein databases. This is done through functions of random variables, the Entropy Measures of probabilities of occurrence of the amino acids. An intensive study of the Pfam databases is presented with restrictions to families which could be represented by rectangular arrays of amino acids with m rows (protein domains) and n columns (amino acids). This work is also an invitation to scientific research groups worldwide to undertake the statistical analysis with different numbers of rows and columns since we believe in the mathematical characterization of the distribution of amino acids as a fundamental insight on the determination of protein structure and evolution.
We present the analytical singular value decomposition of the stoichiometry matrix for a spatially discrete reaction-diffusion system on a one dimensional domain. The domain has two subregions which share a single common boundary. Each of the subregions is further partitioned into a finite number of compartments. Chemical reactions can occur within a compartment, whereas diffusion is represented as movement between adjacent compartments. Inspired by biology, we study both 1) the case where the reactions on each side of the boundary are different and only certain species diffuse across the boundary as well as 2) the case with spatially homogenous reactions and diffusion. We write the stoichiometry matrix for these two classes of systems using a Kronecker product formulation. For the first scenario, we apply linear perturbation theory to derive an approximate singular value decomposition in the limit as diffusion becomes much faster than reactions. For the second scenario, we derive an exact analytical singular value decomposition for all relative diffusion and reaction time scales. By writing the stoichiometry matrix using Kronecker products, we show that the singular vectors and values can also be written concisely using Kronecker products. Ultimately, we find that the singular value decomposition of the reaction-diffusion stoichiometry matrix depends on the singular value decompositions of smaller matrices. These smaller matrices represent modifie
Physically, disordered ensembles of non-homopolymeric polypeptides are expected to be heterogeneous; i.e., they should differ from those homogeneous ensembles of homopolymers that harbor an essentially unique relationship between average values of end-to-end distance $R_{rm EE}$ and radius of gyration $R_{rm g}$. It was posited recently, however, that small-angle X-ray scattering (SAXS) data on conformational dimensions of disordered proteins can be rationalized almost exclusively by homopolymer ensembles. Assessing this perspective, chain-model simulations are used to evaluate the discriminatory power of SAXS-determined molecular form factors (MFFs) with regard to homogeneous versus heterogeneous ensembles. The general approach adopted here is not bound by any assumption about ensemble encodability, in that the postulated heterogeneous ensembles we evaluated are not restricted to those entailed by simple interaction schemes. Our analysis of MFFs for certain heterogeneous ensembles with more narrowly distributed $R_{rm EE}$ and $R_{rm g}$ indicates that while they deviates from MFFs of homogeneous ensembles, the differences can be rather small. Remarkably, some heterogeneous ensembles with asphericity and $R_{rm EE}$ drastically different from those of homogeneous ensembles can nonetheless exhibit practically identical MFFs, demonstrating that SAXS MFFs do not afford unique characterizations of basic properties of conformational ensembles in general. In other words, the ensemble to MFF mapping is practically many-to-one and likely non-smooth. Heteropolymeric variations of the $R_{rm EE}$--$R_{rm g}$ relationship were further showcased using an analytical perturbation theory developed here for flexible heteropolymers. Ramifications of our findings for interpretation of experimental data are discussed.
Here we present ComPPI, a cellular compartment specific database of proteins and their interactions enabling an extensive, compartmentalized protein-protein interaction network analysis (http://ComPPI.LinkGroup.hu). ComPPI enables the user to filter biologically unlikely interactions, where the two interacting proteins have no common subcellular localizations and to predict novel properties, such as compartment-specific biological functions. ComPPI is an integrated database covering four species (S. cerevisiae, C. elegans, D. melanogaster and H. sapiens). The compilation of nine protein-protein interaction and eight subcellular localization data sets had four curation steps including a manually built, comprehensive hierarchical structure of more than 1600 subcellular localizations. ComPPI provides confidence scores for protein subcellular localizations and protein-protein interactions. ComPPI has user-friendly search options for individual proteins giving their subcellular localization, their interactions and the likelihood of their interactions considering the subcellular localization of their interacting partners. Download options of search results, whole proteomes, organelle-specific interactomes and subcellular localization data are available on its website. Due to its novel features, ComPPI is useful for the analysis of experimental results in biochemistry and molecular biology, as well as for proteome-wide studies in bioinformatics and network science helping cellular biology, medicine and drug design.