Minimum Description Length Revisited

68 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Teemu Roos

تاريخ النشر 2019

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Peter Grunwald - Teemu Roos

المنهجية نظرية المعلومات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.

قيم البحث

119 - Leo Touzo , Matteo Marsili , Neri Merhav 2020

We discuss work extraction from classical information engines (e.g., Szilard) with $N$-particles, $q$ partitions, and initial arbitrary non-equilibrium states. In particular, we focus on their {em optimal} behaviour, which includes the measurement of a set of quantities $Phi$ with a feedback protocol that extracts the maximal average amount of work. We show that the optimal non-equilibrium state to which the engine should be driven before the measurement is given by the normalised maximum-likelihood probability distribution of a statistical model that admits $Phi$ as sufficient statistics. Furthermore, we show that the minimax universal code redundancy $mathcal{R}^*$ associated to this model, provides an upper bound to the work that the demon can extract on average from the cycle, in units of $k_{rm B}T$. We also find that, in the limit of $N$ large, the maximum average extracted work cannot exceed $H[Phi]/2$, i.e. one half times the Shannon entropy of the measurement. Our results establish a connection between optimal work extraction in stochastic thermodynamics and optimal universal data compression, providing design principles for optimal information engines. In particular, they suggest that: (i) optimal coding is thermodynamically efficient, and (ii) it is essential to drive the system into a critical state in order to achieve optimal performance.

الميكانيكا الإحصائية نظرية المعلومات نظرية المعلومات

Thermodynamics of the Minimum Description Length on Community Detection

78 - Juan Ignacio Perotti , Claudio Juan Tessone , Aaron Clauset andn Guido Caldarelli 2018

Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to stati stical modeling combining Occams razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we introduce the Boltzmannian MDL (BMDL), a formalization of the principle of MDL with a parametric complexity conveniently formulated as the free-energy of an artificial thermodynamic system. In this way, we leverage on the rich theoretical and technical background of statistical mechanics, to show the crucial importance that phase transitions and other thermodynamic concepts have on the problem of statistical modeling from an information theoretic point of view. For example, we provide information theoretic justifications of why a high-temperature series expansion can be used to compute systematic approximations of the BMDL when the formalism is used to model data, and why statistically significant model selections can be identified with ordered phases when the BMDL is used to model models. To test the introduced formalism, we compute approximations of BMDL for the problem of community detection in complex networks, where we obtain a principled MDL derivation of the Girvan-Newman (GN) modularity and the Zhang-Moore (ZM) community detection method. Here, by means of analytical estimations and numerical experiments on synthetic and empirical networks, we find that BMDL-based correction terms of the GN modularity improve the quality of the detected communities and we also find an information theoretic justification of why the ZM criterion for estimation of the number of network communities is better than alternative approaches such as the bare minimization of a free energy.

الفيزياء والمجتمع الأنظمة المضطربة والشبكات العصبية الشبكات الاجتماعية والمعلومات

Substructure Discovery Using Minimum Description Length and Background Knowledge

71 - D. J. Cook , L. B. Holder 1994

The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length prin ciple. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUEs ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain. Description of Online Appendix: This is a compressed tar file containing the SUBDUE discovery system, written in C. The program accepts as input databases represented in graph form, and will output discovered substructures with their corresponding value.

الذكاء الاصطناعي

Categorical anomaly detection in heterogeneous data using minimum description length clustering

227 - James Cheney , Xavier Gombau , Ghita Berrada 2020

Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets represen ting a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.

قواعد البيانات الذكاء الاصطناعي

Minimum-Link Paths Revisited

419 - Joseph S. B. Mitchell , Valentin Polishchuk , Mikko Sysikaski 2013

A path or a polygonal domain is C-oriented if the orientations of its edges belong to a set of C given orientations; this is a generalization of the notable rectilinear case (C = 2). We study exact and approximation algorithms for minimum-link C-orie nted paths and paths with unrestricted orientations, both in C-oriented and in general domains. Our two main algorithms are as follows: A subquadratic-time algorithm with a non-trivial approximation guarantee for general (unrestricted-orientation) minimum-link paths in general domains. An algorithm to find a minimum-link C-oriented path in a C-oriented domain. Our algorithm is simpler and more time-space efficient than the prior algorithm. We also obtain several related results: - 3SUM-hardness of determining the link distance with unrestricted orientations (even in a rectilinear domain). - An optimal algorithm for finding a minimum-link rectilinear path in a rectilinear domain. The algorithm and its analysis are simpler than the existing ones. - An extension of our methods to find a C-oriented minimum-link path in a general (not necessarily C-oriented) domain. - A more efficient algorithm to compute a 2-approximate C-oriented minimum-link path. - A notion of robust paths. We show how minimum-link C-oriented paths approximate the robust paths with unrestricted orientations to within an additive error of 1.

الهندسة الحسابية