Do you want to publish a course? Click here

Optimizing Variational Representations of Divergences and Accelerating their Statistical Estimation

66   0   0.0 ( 0 )
 Added by Jeremiah Birrell
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

Variational representations of divergences and distances between high-dimensional probability distributions offer significant theoretical insights and practical advantages in numerous research areas. Recently, they have gained popularity in machine learning as a tractable and scalable approach for training probabilistic models and for statistically differentiating between data distributions. Their advantages include: 1) They can be estimated from data as statistical averages. 2) Such representations can leverage the ability of neural networks to efficiently approximate optimal solutions in function spaces. However, a systematic and practical approach to improving tightness of such variational formulas, and accordingly accelerate statistical learning and estimation from data, is lacking. Here we develop such a methodology for building new, tighter variational representations of divergences. Our approach relies on improved objective functionals constructed via an auxiliary optimization problem. Furthermore, the calculation of the functional Hessian of objective functionals unveils local curvature differences around the common optimal variational solution; this quantifies and orders the tightness gains between different variational representations. Finally, numerical simulations utilizing neural-network optimization demonstrate that tighter representations can result in significantly faster learning and more accurate estimation of divergences in both synthetic and real datasets (of more than 1000 dimensions), often accelerated by nearly an order of magnitude.



rate research

Read More

We derive a new variational formula for the Renyi family of divergences, $R_alpha(Q|P)$, between probability measures $Q$ and $P$. Our result generalizes the classical Donsker-Varadhan variational formula for the Kullback-Leibler divergence. We further show that this Renyi variational formula holds over a range of function spaces; this leads to a formula for the optimizer under very weak assumptions and is also key in our development of a consistency theory for Renyi divergence estimators. By applying this theory to neural-network estimators, we show that if a neural network family satisfies one of several strengthen
The Bregman divergence (Bregman distance, Bregman measure of distance) is a certain useful substitute for a distance, obtained from a well-chosen function (the Bregman function). Bregman functions and divergences have been extensively investigated during the last decades and have found applications in optimization, operations research, information theory, nonlinear analysis, machine learning and more. This paper re-examines various aspects related to the theory of Bregman functions and divergences. In particular, it presents many sufficient conditions which allow the construction of Bregman functions in a general setting and introduces new Bregman functions (such as a negative iterated log entropy). Moreover, it sheds new light on several known Bregman functions such as quadratic entropies, the negative Havrda-Charvat-Tsallis entropy, and the negative Boltzmann-Gibbs-Shannon entropy, and it shows that the negative Burg entropy, which is not a Bregman function according to the classical theory but nevertheless is known to have Bregmanian properties, can, by our re-examination of the theory, be considered as a Bregman function. Our analysis yields several by-products of independent interest such as the introduction of the concept of relative uniform convexity (a certain generalization of uniform convexity), new properties of uniformly and strongly convex functions, and results in Banach space theory.
We present a unified technique for sequential estimation of convex divergences between distributions, including integral probability metrics like the kernel maximum mean discrepancy, $varphi$-divergences like the Kullback-Leibler divergence, and optimal transport costs, such as powers of Wasserstein distances. The technical underpinnings of our approach lie in the observation that empirical convex divergences are (partially ordered) reverse submartingales with respect to the exchangeable filtration, coupled with maximal inequalities for such processes. These techniques appear to be powerful additions to the existing literature on both confidence sequences and convex divergences. We construct an offline-to-sequential device that converts a wide array of existing offline concentration inequalities into time-uniform confidence sequences that can be continuously monitored, providing valid inference at arbitrary stopping times. The resulting sequential bounds pay only an iterated logarithmic price over the corresponding fixed-time bounds, retaining the same dependence on problem parameters (like dimension or alphabet size if applicable).
We show that the variational representations for f-divergences currently used in the literature can be tightened. This has implications to a number of methods recently proposed based on this representation. As an example application we use our tighter representation to derive a general f-divergence estimator based on two i.i.d. samples and derive the dual program for this estimator that performs well empirically. We also point out a connection between our estimator and MMD.
The estimation of an f-divergence between two probability distributions based on samples is a fundamental problem in statistics and machine learning. Most works study this problem under very weak assumptions, in which case it is provably hard. We consider the case of stronger structural assumptions that are commonly satisfied in modern machine learning, including representation learning and generative modelling with autoencoder architectures. Under these assumptions we propose and study an estimator that can be easily implemented, works well in high dimensions, and enjoys faster rates of convergence. We verify the behavior of our estimator empirically in both synthetic and real-data experiments, and discuss its direct implications for total correlation, entropy, and mutual information estimation.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا