ترغب بنشر مسار تعليمي؟ اضغط هنا

Pattern recognition on random trees associated to protein functionality families

296   0   0.0 ( 0 )
 نشر من قبل Georgina Flesia MS
 تاريخ النشر 2007
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

In this paper, we address the problem of identifying protein functionality using the information contained in its aminoacid sequence. We propose a method to define sequence similarity relationships that can be used as input for classification and clustering via well known metric based statistical methods. In our examples, we specifically address two problems of supervised and unsupervised learning in structural genomics via simple metric based techniques on the space of trees 1)Unsupervised detection of functionality families via K means clustering in the space of trees, 2)Classification of new proteins into known families via k nearest neighbour trees. We found evidence that the similarity measure induced by our approach concentrates information for discrimination. Classification has the same high performance than others VLMC approaches. Clustering is a harder task, though, but our approach for clustering is alignment free and automatic, and may lead to many interesting variations by choosing other clustering or classification procedures that are based on pre-computed similarity information, as the ones that performs clustering using flow simulation, see (Yona et al 2000, Enright et al, 2003).

قيم البحث

اقرأ أيضاً

Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.
Protein pattern formation is essential for the spatial organization of many intracellular processes like cell division, flagellum positioning, and chemotaxis. A prominent example of intracellular patterns are the oscillatory pole-to-pole oscillations of Min proteins in textit{E. coli} whose biological function is to ensure precise cell division. Cell polarization, a prerequisite for processes such as stem cell differentiation and cell polarity in yeast, is also mediated by a diffusion-reaction process. More generally, these functional modules of cells serve as model systems for self-organization, one of the core principles of life. Under which conditions spatio-temporal patterns emerge, and how these patterns are regulated by biochemical and geometrical factors are major aspects of current research. Here we review recent theoretical and experimental advances in the field of intracellular pattern formation, focusing on general design principles and fundamental physical mechanisms.
We study density requirements on a given Banach space that guarantee the existence of subsymmetric basic sequences by extending Tsirelsons well-known space to larger index sets. We prove that for every cardinal $kappa$ smaller than the first Mahlo ca rdinal there is a reflexive Banach space of density $kappa$ without subsymmetric basic sequences. As for Tsirelsons space, our construction is based on the existence of a rich collection of homogeneous families on large index sets for which one can estimate the complexity on any given infinite set. This is used to describe detailedly the asymptotic structure of the spaces. The collections of families are of independent interest and their existence is proved inductively. The fundamental stepping up argument is the analysis of such collections of families on trees.
The Maki-Thompson rumor model is defined by assuming that a population represented by a graph is subdivided into three classes of individuals; namely, ignorants, spreaders and stiflers. A spreader tells the rumor to any of its nearest ignorant neighb ors at rate one. At the same rate, a spreader becomes a stifler after a contact with other nearest neighbor spreaders, or stiflers. In this work we study the model on random trees. As usual we define a critical parameter of the model as the critical value around which the rumor either becomes extinct almost-surely or survives with positive probability. We analyze the existence of phase-transition regarding the survival of the rumor, and we obtain estimates for the mean range of the rumor. The applicability of our results is illustrated with examples on random trees generated from some well-known discrete distributions.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا