ترغب بنشر مسار تعليمي؟ اضغط هنا

Determining the Unithood of Word Sequences using Mutual Information and Independence Measure

204   0   0.0 ( 0 )
 نشر من قبل Wilson Wong
 تاريخ النشر 2008
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Most works related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, the number of independent research that study the notion of unithood and produce dedicated techniques for measuring unithood is extremely small. We propose a new approach, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our evaluations revealed a precision and recall of 98.68% and 91.82% respectively with an accuracy at 95.42% in measuring the unithood of 1005 test cases.

قيم البحث

اقرأ أيضاً

Most research related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, novelties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated a nd derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our comparative study using 1,825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy.
Functional protein-protein interactions are crucial in most cellular processes. They enable multi-protein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins res ult in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are functional interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. Our mutual information-based method also provides signatures of the existence of interactions between protein families. These results stand in contrast with structure prediction of proteins and of multi-protein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins.
The achievable error-exponent pairs for the type I and type II errors are characterized in a hypothesis testing setup where the observation consists of independent and identically distributed samples from either a known joint probability distribution or an unknown product distribution. The empirical mutual information test, the Hoeffding test, and the generalized likelihood-ratio test are all shown to be asymptotically optimal. An expression based on a Renyi measure of dependence is shown to be the Fenchel biconjugate of the error-exponent function obtained by fixing one error exponent and optimizing the other. An example is provided where the error-exponent function is not convex and thus not equal to its Fenchel biconjugate.
We derive independence tests by means of dependence measures thresholding in a semiparametric context. Precisely, estimates of phi-mutual informations, associated to phi-divergences between a joint distribution and the product distribution of its mar gins, are derived through the dual representation of phi-divergences. The asymptotic properties of the proposed estimates are established, including consistency, asymptotic distributions and large deviations principle. The obtained tests of independence are compared via their relative asymptotic Bahadur efficiency and numerical simulations. It follows that the proposed semiparametric Kullback-Leibler Mutual information test is the optimal one. On the other hand, the proposed approach provides a new method for estimating the Kullback-Leibler mutual information in a semiparametric setting, as well as a model selection procedure in large class of dependency models including semiparametric copulas.
The information theoretic quantity known as mutual information finds wide use in classification and community detection analyses to compare two classifications of the same set of objects into groups. In the context of classification algorithms, for i nstance, it is often used to compare discovered classes to known ground truth and hence to quantify algorithm performance. Here we argue that the standard mutual information, as commonly defined, omits a crucial term which can become large under real-world conditions, producing results that can be substantially in error. We demonstrate how to correct this error and define a mutual information that works in all cases. We discuss practical implementation of the new measure and give some example applications.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا