No Arabic abstract
We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of representations to be factorial and hence independent across the dimensions. We show that it improves upon $beta$-VAE by providing a better trade-off between disentanglement and reconstruction quality. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them.
We develop a generalisation of disentanglement in VAEs---decomposition of the latent representation---characterising it as the fulfilment of two factors: a) the latent encodings of the data having an appropriate level of overlap, and b) the aggregate encoding of the data conforming to a desired structure, represented through the prior. Decomposition permits disentanglement, i.e. explicit independence between latents, as a special case, but also allows for a much richer class of properties to be imposed on the learnt representation, such as sparsity, clustering, independent subspaces, or even intricate hierarchical dependency relationships. We show that the $beta$-VAE varies from the standard VAE predominantly in its control of latent overlap and that for the standard choice of an isotropic Gaussian prior, its objective is invariant to rotations of the latent representation. Viewed from the decomposition perspective, breaking this invariance with simple manipulations of the prior can yield better disentanglement with little or no detriment to reconstructions. We further demonstrate how other choices of prior can assist in producing different decompositions and introduce an alternative training objective that allows the control of both decomposition factors in a principled manner.
The vulnerability to slight input perturbations is a worrying yet intriguing property of deep neural networks (DNNs). Despite many previous works studying the reason behind such adversarial behavior, the relationship between the generalization performance and adversarial behavior of DNNs is still little understood. In this work, we reveal such relation by introducing a metric characterizing the generalization performance of a DNN. The metric can be disentangled into an information-theoretic non-robust component, responsible for adversarial behavior, and a robust component. Then, we show by experiments that current DNNs rely heavily on optimizing the non-robust component in achieving decent performance. We also demonstrate that current state-of-the-art adversarial training algorithms indeed try to robustify the DNNs by preventing them from using the non-robust component to distinguish samples from different categories. Also, based on our findings, we take a step forward and point out the possible direction for achieving decent standard performance and adversarial robustness simultaneously. We believe that our theory could further inspire the community to make more interesting discoveries about the relationship between standard generalization and adversarial generalization of deep learning models.
Learning representations that disentangle the underlying factors of variability in data is an intuitive way to achieve generalization in deep models. In this work, we address the scenario where generative factors present a multimodal distribution due to the existence of class distinction in the data. We propose N-VAE, a model which is capable of separating factors of variation which are exclusive to certain classes from factors that are shared among classes. This model implements an explicitly compositional latent variable structure by defining a class-conditioned latent space and a shared latent space. We show its usefulness for detecting and disentangling class-dependent generative factors as well as its capacity to generate artificial samples which contain characteristics unseen in the training data.
We introduce a new general identifiable framework for principled disentanglement referred to as Structured Nonlinear Independent Component Analysis (SNICA). Our contribution is to extend the identifiability theory of deep generative models for a very broad class of structured models. While previous works have shown identifiability for specific classes of time-series models, our theorems extend this to more general temporal structures as well as to models with more complex structures such as spatial dependencies. In particular, we establish the major result that identifiability for this framework holds even in the presence of noise of unknown distribution. The SNICA setting therefore subsumes all the existing nonlinear ICA models for time-series and also allows for new much richer identifiable models. Finally, as an example of our frameworks flexibility, we introduce the first nonlinear ICA model for time-series that combines the following very useful properties: it accounts for both nonstationarity and autocorrelation in a fully unsupervised setting; performs dimensionality reduction; models hidden states; and enables principled estimation and inference by variational maximum-likelihood.
In this thesis, we disentangle the generalized Gauss-Newton and approximate inference for Bayesian deep learning. The generalized Gauss-Newton method is an optimization method that is used in several popular Bayesian deep learning algorithms. Algorithms that combine the Gauss-Newton method with the Laplace and Gaussian variational approximation have recently led to state-of-the-art results in Bayesian deep learning. While the Laplace and Gaussian variational approximation have been studied extensively, their interplay with the Gauss-Newton method remains unclear. Recent criticism of priors and posterior approximations in Bayesian deep learning further urges the need for a deeper understanding of practical algorithms. The individual analysis of the Gauss-Newton method and Laplace and Gaussian variational approximations for neural networks provides both theoretical insight and new practical algorithms. We find that the Gauss-Newton method simplifies the underlying probabilistic model significantly. In particular, the combination of the Gauss-Newton method with approximate inference can be cast as inference in a linear or Gaussian process model. The Laplace and Gaussian variational approximation can subsequently provide a posterior approximation to these simplified models. This new disentangled understanding of recent Bayesian deep learning algorithms also leads to new methods: first, the connection to Gaussian processes enables new function-space inference algorithms. Second, we present a marginal likelihood approximation of the underlying probabilistic model to tune neural network hyperparameters. Finally, the identified underlying models lead to different methods to compute predictive distributions. In fact, we find that these prediction methods for Bayesian neural networks often work better than the default choice and solve a common issue with the Laplace approximation.