ترغب بنشر مسار تعليمي؟ اضغط هنا

Substructure Discovery Using Minimum Description Length and Background Knowledge

72   0   0.0 ( 0 )
 نشر من قبل ul
 تاريخ النشر 1994
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUEs ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain. Description of Online Appendix: This is a compressed tar file containing the SUBDUE discovery system, written in C. The program accepts as input databases represented in graph form, and will output discovered substructures with their corresponding value.



قيم البحث

اقرأ أيضاً

This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was origi nally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.
Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets represen ting a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.
We discuss work extraction from classical information engines (e.g., Szilard) with $N$-particles, $q$ partitions, and initial arbitrary non-equilibrium states. In particular, we focus on their {em optimal} behaviour, which includes the measurement of a set of quantities $Phi$ with a feedback protocol that extracts the maximal average amount of work. We show that the optimal non-equilibrium state to which the engine should be driven before the measurement is given by the normalised maximum-likelihood probability distribution of a statistical model that admits $Phi$ as sufficient statistics. Furthermore, we show that the minimax universal code redundancy $mathcal{R}^*$ associated to this model, provides an upper bound to the work that the demon can extract on average from the cycle, in units of $k_{rm B}T$. We also find that, in the limit of $N$ large, the maximum average extracted work cannot exceed $H[Phi]/2$, i.e. one half times the Shannon entropy of the measurement. Our results establish a connection between optimal work extraction in stochastic thermodynamics and optimal universal data compression, providing design principles for optimal information engines. In particular, they suggest that: (i) optimal coding is thermodynamically efficient, and (ii) it is essential to drive the system into a critical state in order to achieve optimal performance.
Open-domain question answering (QA) is an important problem in AI and NLP that is emerging as a bellwether for progress on the generalizability of AI methods and techniques. Much of the progress in open-domain QA systems has been realized through adv ances in information retrieval methods and corpus construction. In this paper, we focus on the recently introduced ARC Challenge dataset, which contains 2,590 multiple choice questions authored for grade-school science exams. These questions are selected to be the most challenging for current QA systems, and current state of the art performance is only slightly better than random chance. We present a system that rewrites a given question into queries that are used to retrieve supporting text from a large corpus of science-related text. Our rewriter is able to incorporate background knowledge from ConceptNet and -- in tandem with a generic textual entailment system trained on SciTail that identifies support in the retrieved results -- outperforms several strong baselines on the end-to-end QA task despite only being trained to identify essential terms in the original source question. We use a generalizable decision methodology over the retrieved evidence and answer candidates to select the best answer. By combining query rewriting, background knowledge, and textual entailment our system is able to outperform several strong baselines on the ARC dataset.
141 - Fadi Badra 2009
Adaptation has long been considered as the Achilles heel of case-based reasoning since it requires some domain-specific knowledge that is difficult to acquire. In this paper, two strategies are combined in order to reduce the knowledge engineering co st induced by the adaptation knowledge (CA) acquisition task: CA is learned from the case base by the means of knowledge discovery techniques, and the CA acquisition sessions are opportunistically triggered, i.e., at problem-solving time.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا