No Arabic abstract
Deep latent-variable models learn representations of high-dimensional data in an unsupervised manner. A number of recent efforts have focused on learning representations that disentangle statistically independent axes of variation by introducing modifications to the standard objective function. These approaches generally assume a simple diagonal Gaussian prior and as a result are not able to reliably disentangle discrete factors of variation. We propose a two-level hierarchical objective to control relative degree of statistical independence between blocks of variables and individual variables within blocks. We derive this objective as a generalization of the evidence lower bound, which allows us to explicitly represent the trade-offs between mutual information between data and representation, KL divergence between representation and prior, and coverage of the support of the empirical data distribution. Experiments on a variety of datasets demonstrate that our objective can not only disentangle discrete variables, but that doing so also improves disentanglement of other variables and, importantly, generalization even to unseen combinations of factors.
We present a deep generative model that learns disentangled static and dynamic representations of data from unordered input. Our approach exploits regularities in sequential data that exist regardless of the order in which the data is viewed. The result of our factorized graphical model is a well-organized and coherent latent space for data dynamics. We demonstrate our method on several synthetic dynamic datasets and real video data featuring various facial expressions and head poses.
In the problem of learning disentangled representations, one of the promising methods is to factorize aggregated posterior by penalizing the total correlation of sampled latent variables. However, this well-motivated strategy has a blind spot: there is a disparity between the sampled latent representation and its corresponding mean representation. In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation. Indeed, we prove that for the multivariate normal distributions, the mean representation with arbitrarily high total correlation can have a corresponding sampled representation with bounded total correlation. We also propose a method to eliminate this disparity. Experiments show that our model can learn a mean representation with much lower total correlation, hence a factorized mean representation. Moreover, we offer a detailed explanation of the limitations of factorizing aggregated posterior: factor disintegration. Our work indicates a potential direction for future research of disentangled learning.
Unsupervised model transfer has the potential to greatly improve the generalizability of deep models to novel domains. Yet the current literature assumes that the separation of target data into distinct domains is known as a priori. In this paper, we propose the task of Domain-Agnostic Learning (DAL): How to transfer knowledge from a labeled source domain to unlabeled data from arbitrary target domains? To tackle this problem, we devise a novel Deep Adversarial Disentangled Autoencoder (DADA) capable of disentangling domain-specific features from class identity. We demonstrate experimentally that when the target domain labels are unknown, DADA leads to state-of-the-art performance on several image classification datasets.
Face images are subject to many different factors of variation, especially in unconstrained in-the-wild scenarios. For most tasks involving such images, e.g. expression recognition from video streams, having enough labeled data is prohibitively expensive. One common strategy to tackle such a problem is to learn disentangled representations for the different factors of variation of the observed data using adversarial learning. In this paper, we use a formulation of the adversarial loss to learn disentangled representations for face images. The used model facilitates learning on single-task datasets and improves the state-of-the-art in expression recognition with an accuracy of60.53%on the AffectNetdataset, without using any additional data.
Structured prediction tasks pose a fundamental trade-off between the need for model complexity to increase predictive power and the limited computational resources for inference in the exponentially-sized output spaces such models require. We formulate and develop the Structured Prediction Cascade architecture: a sequence of increasingly complex models that progressively filter the space of possible outputs. The key principle of our approach is that each model in the cascade is optimized to accurately filter and refine the structured output state space of the next model, speeding up both learning and inference in the next layer of the cascade. We learn cascades by optimizing a novel convex loss function that controls the trade-off between the filtering efficiency and the accuracy of the cascade, and provide generalization bounds for both accuracy and efficiency. We also extend our approach to intractable models using tree-decomposition ensembles, and provide algorithms and theory for this setting. We evaluate our approach on several large-scale problems, achieving state-of-the-art performance in handwriting recognition and human pose recognition. We find that structured prediction cascades allow tremendous speedups and the use of previously intractable features and models in both settings.