Do you want to publish a course? Click here

Evaluating the Impact of Information Distortion on Normalized Compression Distance

222   0   0.0 ( 0 )
 Added by Ana Granados
 Publication date 2008
and research's language is English




Ask ChatGPT about the research

In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique (the latter based on Normalized Compression Distance, NCD). We show how to decrease the complexity of the considered books introducing several modifications in them. We measure how the information contained in each book is maintained using a clustering error measure. We find experimentally that the best way to keep the clustering error is by means of modifications in the most frequent words. We explain the details of these information distortions and we compare with other kinds of modifications like random word distortions and unfrequent word distortions. Finally, some phenomenological explanations from the different empirical results that have been carried out are presented.



rate research

Read More

While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two pictures. We give several natural definitions of a universal information metric, based on length of shortest programs for either ordinary computations or reversible (dissipationless) computations. It turns out that these definitions are equivalent up to an additive logarithmic term. We show that the information distance is a universal cognitive similarity distance. We investigate the maximal correlation of the shortest programs involved, the maximal uncorrelation of programs (a generalization of the Slepian-Wolf theorem of classical information theory), and the density properties of the discrete metric spaces induced by the information distances. A related distance measures the amount of nonreversibility of a computation. Using the physical theory of reversible computation, we give an appropriate (universal, anti-symmetric, and transitive) measure of the thermodynamic work required to transform one object in another object by the most efficient process. Information distance between individual objects is needed in pattern recognition where one wants to express effective notions of pattern similarity or cognitive similarity between individual objects and in thermodynamics of computation where one wants to analyse the energy dissipation of a computation from a particular input to a particular output.
357 - Paul M.B. Vitanyi 2008
The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.
The objective of this paper is to further investigate various applications of information Nonanticipative Rate Distortion Function (NRDF) by discussing two working examples, the Binary Symmetric Markov Source with parameter $p$ (BSMS($p$)) with Hamming distance distortion, and the multidimensional partially observed Gaussian-Markov source. For the BSMS($p$), we give the solution to the NRDF, and we use it to compute the Rate Loss (RL) of causal codes with respect to noncausal codes. For the multidimensional Gaussian-Markov source, we give the solution to the NRDF, we show its operational meaning via joint source-channel matching over a vector of parallel Gaussian channels, and we compute the RL of causal and zero-delay codes with respect to noncausal codes.
A rate-distortion problem motivated by the consideration of semantic information is formulated and solved. The starting point is to model an information source as a pair consisting of an intrinsic state which is not observable, corresponding to the semantic aspect of the source, and an extrinsic observation which is subject to lossy source coding. The proposed rate-distortion problem seeks a description of the information source, via encoding the extrinsic observation, under two distortion constraints, one for the intrinsic state and the other for the extrinsic observation. The corresponding state-observation rate-distortion function is obtained, and a few case studies of Gaussian intrinsic state estimation and binary intrinsic state classification are studied.
According to Kolmogorov complexity, every finite binary string is compressible to a shortest code -- its information content -- from which it is effectively recoverable. We investigate the extent to which this holds for infinite binary sequences (streams). We devise a new coding method which uniformly codes every stream $X$ into an algorithmically random stream $Y$, in such a way that the first $n$ bits of $X$ are recoverable from the first $I(Xupharpoonright_n)$ bits of $Y$, where $I$ is any partial computable information content measure which is defined on all prefixes of $X$, and where $Xupharpoonright_n$ is the initial segment of $X$ of length $n$. As a consequence, if $g$ is any computable upper bound on the initial segment prefix-free complexity of $X$, then $X$ is computable from an algorithmically random $Y$ with oracle-use at most $g$. Alternatively (making no use of such a computable bound $g$) one can achieve an oracle-use bounded above by $K(Xupharpoonright_n)+log n$. This provides a strong analogue of Shannons source coding theorem for algorithmic information theory.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا