No MCMC for me: Amortized sampling for fast and stable training of energy-based models

72 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Will Grathwohl

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Will Grathwohl - Jacob Kelly - Milad Hashemi

التعلم الآلي الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work, we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains.

قيم البحث

138 - Sahil Verma , Keegan Hines , John P. Dickerson 2021

Explainable machine learning (ML) has gained traction in recent years due to the increasing adoption of ML-based systems in many sectors. Counterfactual explanations (CFEs) provide ``what if feedback of the form ``if an input datapoint were $x$ inste ad of $x$, then an ML-based systems output would be $y$ instead of $y$. CFEs are attractive due to their actionable feedback, amenability to existing legal frameworks, and fidelity to the underlying ML model. Yet, current CFE approaches are single shot -- that is, they assume $x$ can change to $x$ in a single time period. We propose a novel stochastic-control-based approach that generates sequential CFEs, that is, CFEs that allow $x$ to move stochastically and sequentially across intermediate states to a final state $x$. Our approach is model agnostic and black box. Furthermore, calculation of CFEs is amortized such that once trained, it applies to multiple datapoints without the need for re-optimization. In addition to these primary characteristics, our approach admits optional desiderata such as adherence to the data manifold, respect for causal relations, and sparsity -- identified by past research as desirable properties of CFEs. We evaluate our approach using three real-world datasets and show successful generation of sequential CFEs that respect other counterfactual desiderata.

التعلم الآلي الذكاء الاصطناعي

Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling

115 - Will Grathwohl , Kuan-Chieh Wang , Jorn-Henrik Jacobsen 2020

We present a new method for evaluating and training unnormalized density models. Our approach only requires access to the gradient of the unnormalized models log-density. We estimate the Stein discrepancy between the data density $p(x)$ and the model density $q(x)$ defined by a vector function of the data. We parameterize this function with a neural network and fit its parameters to maximize the discrepancy. This yields a novel goodness-of-fit test which outperforms existing methods on high dimensional data. Furthermore, optimizing $q(x)$ to minimize this discrepancy produces a novel method for training unnormalized models which scales more gracefully than existing methods. The ability to both learn and compare models is a unique feature of the proposed method.

التعلم الالي التعلم الآلي

Improved Contrastive Divergence Training of Energy Based Models

104 - Yilun Du , Shuang Li , Joshua Tenenbaum 2020

Contrastive divergence is a popular method of training energy-based models, but is known to have difficulties with training stability. We propose an adaptation to improve contrastive divergence training by scrutinizing a gradient term that is difficu lt to calculate and is often left out for convenience. We show that this gradient term is numerically significant and in practice is important to avoid training instabilities, while being tractable to estimate. We further highlight how data augmentation and multi-scale processing can be used to improve model robustness and generation quality. Finally, we empirically evaluate stability of model architectures and show improved performance on a host of benchmarks and use cases,such as image generation, OOD detection, and compositional generation.

التعلم الآلي

Adversarial Stein Training for Graph Energy Models

292 - Shiv Shankar 2021

Learning distributions over graph-structured data is a challenging task with many applications in biology and chemistry. In this work we use an energy-based model (EBM) based on multi-channel graph neural networks (GNN) to learn permutation invariant unnormalized density functions on graphs. Unlike standard EBM training methods our approach is to learn the model via minimizing adversarial stein discrepancy. Samples from the model can be obtained via Langevin dynamics based MCMC. We find that this approach achieves competitive results on graph generation compared to benchmark models.

التعلم الآلي

Fast MCMC sampling algorithms on polytopes

213 - Yuansi Chen , Raaz Dwivedi , Martin J. Wainwright 2017

We propose and analyze two new MCMC sampling algorithms, the Vaidya walk and the John walk, for generating samples from the uniform distribution over a polytope. Both random walks are sampling algorithms derived from interior point methods. The forme r is based on volumetric-logarithmic barrier introduced by Vaidya whereas the latter uses Johns ellipsoids. We show that the Vaidya walk mixes in significantly fewer steps than the logarithmic-barrier based Dikin walk studied in past work. For a polytope in $mathbb{R}^d$ defined by $n >d$ linear constraints, we show that the mixing time from a warm start is bounded as $mathcal{O}(n^{0.5}d^{1.5})$, compared to the $mathcal{O}(nd)$ mixing time bound for the Dikin walk. The cost of each step of the Vaidya walk is of the same order as the Dikin walk, and at most twice as large in terms of constant pre-factors. For the John walk, we prove an $mathcal{O}(d^{2.5}cdotlog^4(n/d))$ bound on its mixing time and conjecture that an improved variant of it could achieve a mixing time of $mathcal{O}(d^2cdottext{polylog}(n/d))$. Additionally, we propose variants of the Vaidya and John walks that mix in polynomial time from a deterministic starting point. The speed-up of the Vaidya walk over the Dikin walk are illustrated in numerical examples.

التعلم الالي