ترغب بنشر مسار تعليمي؟ اضغط هنا

GAIT: A Geometric Approach to Information Theory

153   0   0.0 ( 0 )
 نشر من قبل Jose Daniel Gallego Posada
 تاريخ النشر 2019
والبحث باللغة English




اسأل ChatGPT حول البحث

We advocate the use of a notion of entropy that reflects the relative abundances of the symbols in an alphabet, as well as the similarities between them. This concept was originally introduced in theoretical ecology to study the diversity of ecosystems. Based on this notion of entropy, we introduce geometry-aware counterparts for several concepts and theorems in information theory. Notably, our proposed divergence exhibits performance on par with state-of-the-art methods based on the Wasserstein distance, but enjoys a closed-form expression that can be computed efficiently. We demonstrate the versatility of our method via experiments on a broad range of domains: training generative models, computing image barycenters, approximating empirical measures and counting modes.

قيم البحث

اقرأ أيضاً

Nowozin textit{et al} showed last year how to extend the GAN textit{principle} to all $f$-divergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for exam ple, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generators design and compare to the flourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens --- namely, deformed exponential families, a wide superset of exponential families --- and show tight connections with the three other key GAN parameters: loss, game and architecture. In particular, we show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the $f$-GAN game. This result holds given a sufficient condition on textit{activation functions} --- which turns out to be satisfied by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses link function in the discriminator.
Feed-forward neural networks consist of a sequence of layers, in which each layer performs some processing on the information from the previous layer. A downside to this approach is that each layer (or module, as multiple modules can operate in paral lel) is tasked with processing the entire hidden state, rather than a particular part of the state which is most relevant for that module. Methods which only operate on a small number of input variables are an essential part of most programming languages, and they allow for improved modularity and code re-usability. Our proposed method, Neural Function Modules (NFM), aims to introduce the same structural capability into deep learning. Most of the work in the context of feed-forward networks combining top-down and bottom-up feedback is limited to classification problems. The key contribution of our work is to combine attention, sparsity, top-down and bottom-up feedback, in a flexible algorithm which, as we show, improves the results in standard classification, out-of-domain generalization, generative modeling, and learning representations in the context of reinforcement learning.
We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannons information theory that takes into account the modeling power and computational constraints of the observer. The resulting emph{predictive $mathcal{V}$-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannons mutual information and in violation of the data processing inequality, $mathcal{V}$-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, $mathcal{V}$-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive $mathcal{V}$-information is more effective than mutual information for structure learning and fair representation learning.
This paper prescribes a distance between learning tasks modeled as joint distributions on data and labels. Using tools in information geometry, the distance is defined to be the length of the shortest weight trajectory on a Riemannian manifold as a c lassifier is fitted on an interpolated task. The interpolated task evolves from the source to the target task using an optimal transport formulation. This distance, which we call the coupled transfer distance can be compared across different classifier architectures. We develop an algorithm to compute the distance which iteratively transports the marginal on the data of the source task to that of the target task while updating the weights of the classifier to track this evolving data distribution. We develop theory to show that our distance captures the intuitive idea that a good transfer trajectory is the one that keeps the generalization gap small during transfer, in particular at the end on the target task. We perform thorough empirical validation and analysis across diverse image classification datasets to show that the coupled transfer distance correlates strongly with the difficulty of fine-tuning.
130 - Weiran Wang 2019
We adopt a multi-view approach for analyzing two knowledge transfer settings---learning using privileged information (LUPI) and distillation---in a common framework. Under reasonable assumptions about the complexities of hypothesis spaces, and being optimistic about the expected loss achievable by the student (in distillation) and a transformed teacher predictor (in LUPI), we show that encouraging agreement between the teacher and the student leads to reduced search space. As a result, improved convergence rate can be obtained with regularized empirical risk minimization.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا