Towards an Interpretable Latent Space in Structured Models for Video Prediction

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Vishal Sharma

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Rushil Gupta - Vishal Sharma - Yash Jain

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We focus on the task of future frame prediction in video governed by underlying physical dynamics. We work with models which are object-centric, i.e., explicitly work with object representations, and propagate a loss in the latent space. Specifically, our research builds on recent work by Kipf et al. cite{kipf&al20}, which predicts the next state via contrastive learning of object interactions in a latent space using a Graph Neural Network. We argue that injecting explicit inductive bias in the model, in form of general physical laws, can help not only make the model more interpretable, but also improve the overall prediction of model. As a natural by-product, our model can learn feature maps which closely resemble actual object positions in the image, without having any explicit supervision about the object positions at the training time. In comparison with earlier works cite{jaques&al20}, which assume a complete knowledge of the dynamics governing the motion in the form of a physics engine, we rely only on the knowledge of general physical laws, such as, world consists of objects, which have position and velocity. We propose an additional decoder based loss in the pixel space, imposed in a curriculum manner, to further refine the latent space predictions. Experiments in multiple different settings demonstrate that while Kipf et al. model is effective at capturing object interactions, our model can be significantly more effective at localising objects, resulting in improved performance in 3 out of 4 domains that we experiment with. Additionally, our model can learn highly intrepretable feature maps, resembling actual object positions.

قيم البحث

80 - Hanjun Dai , Bo Dai , Le Song 2016

Kernel classifiers and regressors designed for structured data, such as sequences, trees and graphs, have significantly advanced a number of interdisciplinary areas such as computational biology and drug design. Typically, kernels are designed before hand for a data type which either exploit statistics of the structures or make use of probabilistic generative models, and then a discriminative classifier is learned based on the kernels via convex optimization. However, such an elegant two-stage approach also limited kernel methods from scaling up to millions of data points, and exploiting discriminative information to learn feature representations. We propose, structure2vec, an effective and scalable approach for structured data representation based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Interestingly, structure2vec extracts features by performing a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving millions of data points, we showed that structure2vec runs 2 times faster, produces models which are $10,000$ times smaller, while at the same time achieving the state-of-the-art predictive performance.

التعلم الآلي

Structured Object-Aware Physics Prediction for Video Modeling and Planning

101 - Jannik Kossen , Karl Stelzner , Marcel Hussing 2019

When humans observe a physical system, they can easily locate objects, understand their interactions, and anticipate future behavior, even in settings with complicated and previously unseen interactions. For computers, however, learning such models f rom videos in an unsupervised fashion is an unsolved research problem. In this paper, we present STOVE, a novel state-space model for videos, which explicitly reasons about objects and their positions, velocities, and interactions. It is constructed by combining an image model and a dynamics model in compositional manner and improves on previous work by reusing the dynamics model for inference, accelerating and regularizing training. STOVE predicts videos with convincing physical behavior over hundreds of timesteps, outperforms previous unsupervised models, and even approaches the performance of supervised baselines. We further demonstrate the strength of our model as a simulator for sample efficient model-based control in a task with heavily interacting objects.

التعلم الآلي الرؤية الحاسوبية وتمييز الأنماط التعلم الالي

Latent Space Optimal Transport for Generative Models

88 - Huidong Liu , Yang Guo , Na Lei 2018

Variational Auto-Encoders enforce their learned intermediate latent-space data distribution to be a simple distribution, such as an isotropic Gaussian. However, this causes the posterior collapse problem and loses manifold structure which can be impo rtant for datasets such as facial images. A GAN can transform a simple distribution to a latent-space data distribution and thus preserve the manifold structure, but optimizing a GAN involves solving a Min-Max optimization problem, which is difficult and not well understood so far. Therefore, we propose a GAN-like method to transform a simple distribution to a data distribution in the latent space by solving only a minimization problem. This minimization problem comes from training a discriminator between a simple distribution and a latent-space data distribution. Then, we can explicitly formulate an Optimal Transport (OT) problem that computes the desired mapping between the two distributions. This means that we can transform a distribution without solving the difficult Min-Max optimization problem. Experimental results on an eight-Gaussian dataset show that the proposed OT can handle multi-cluster distributions. Results on the MNIST and the CelebA datasets validate the effectiveness of the proposed method.

التعلم الآلي التعلم الالي

Approximated Structured Prediction for Learning Large Scale Graphical Models

279 - Tamir Hazan , Raquel Urtasun 2010

This manuscripts contains the proofs for A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction.

التعلم الآلي الذكاء الاصطناعي

Fast Compressive Sensing Recovery Using Generative Models with Structured Latent Variables

77 - Shaojie Xu , Sihan Zeng , Justin Romberg 2019

Deep learning models have significantly improved the visual quality and accuracy on compressive sensing recovery. In this paper, we propose an algorithm for signal reconstruction from compressed measurements with image priors captured by a generative model. We search and constrain on latent variable space to make the method stable when the number of compressed measurements is extremely limited. We show that, by exploiting certain structures of the latent variables, the proposed method produces improved reconstruction accuracy and preserves realistic and non-smooth features in the image. Our algorithm achieves high computation speed by projecting between the original signal space and the latent variable space in an alternating fashion.

التعلم الآلي الرؤية الحاسوبية وتمييز الأنماط التعلم الالي