Traditional sequential multi-object attention models rely on a recurrent mechanism to infer object relations. We propose a relational extension (R-SQAIR) of one such attention model (SQAIR) by endowing it with a module with strong relational inductive bias that computes in parallel pairwise interactions between inferred objects. Two recently proposed relational modules are studied on tasks of unsupervised learning from videos. We demonstrate gains over sequential relational mechanisms, also in terms of combinatorial generalization.
We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for videos of moving objects. It can reliably discover and track objects throughout the sequence of frames, and can also generate future frames conditioning o
n the current frame, thereby simulating expected motion of objects. This is achieved by explicitly encoding object presence, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016), including learning in an unsupervised manner, and addresses its shortcomings. We use a moving multi-MNIST dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how SQAIR overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.
Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember.
Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected -- i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module -- a textit{Relational Memory Core} (RMC) -- which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (e.g. Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.
Bi-linear feature learning models, like the gated autoencoder, were proposed as a way to model relationships between frames in a video. By minimizing reconstruction error of one frame, given the previous frame, these models learn mapping units that e
ncode the transformations inherent in a sequence, and thereby learn to encode motion. In this work we extend bi-linear models by introducing higher-order mapping units that allow us to encode transformations between frames and transformations between transformations. We show that this makes it possible to encode temporal structure that is more complex and longer-range than the structure captured within standard bi-linear models. We also show that a natural way to train the model is by replacing the commonly used reconstruction objective with a prediction objective which forces the model to correctly predict the evolution of the input multiple steps into the future. Learning can be achieved by back-propagating the multi-step prediction through time. We test the model on various temporal prediction tasks, and show that higher-order mappings and predictive training both yield a significant improvement over bi-linear models in terms of prediction accuracy.
With a view to bridging the gap between deep learning and symbolic AI, we present a novel end-to-end neural network architecture that learns to form propositional representations with an explicitly relational structure from raw pixel data. In order t
o evaluate and analyse the architecture, we introduce a family of simple visual relational reasoning tasks of varying complexity. We show that the proposed architecture, when pre-trained on a curriculum of such tasks, learns to generate reusable representations that better facilitate subsequent learning on previously unseen tasks when compared to a number of baseline architectures. The workings of a successfully trained model are visualised to shed some light on how the architecture functions.
User attributes, such as gender and education, face severe incompleteness in social networks. In order to make this kind of valuable data usable for downstream tasks like user profiling and personalized recommendation, attribute inference aims to inf
er users missing attribute labels based on observed data. Recently, variational autoencoder (VAE), an end-to-end deep generative model, has shown promising performance by handling the problem in a semi-supervised way. However, VAEs can easily suffer from over-fitting and over-smoothing when applied to attribute inference. To be specific, VAE implemented with multi-layer perceptron (MLP) can only reconstruct input data but fail in inferring missing parts. While using the trending graph neural networks (GNNs) as encoder has the problem that GNNs aggregate redundant information from neighborhood and generate indistinguishable user representations, which is known as over-smoothing. In this paper, we propose an attribute textbf{Infer}ence model based on textbf{A}dversarial textbf{VAE} (Infer-AVAE) to cope with these issues. Specifically, to overcome over-smoothing, Infer-AVAE unifies MLP and GNNs in encoder to learn positive and negative latent representations respectively. Meanwhile, an adversarial network is trained to distinguish the two representations and GNNs are trained to aggregate less noise for more robust representations through adversarial training. Finally, to relieve over-fitting, mutual information constraint is introduced as a regularizer for decoder, so that it can make better use of auxiliary information in representations and generate outputs not limited by observations. We evaluate our model on 4 real-world social network datasets, experimental results demonstrate that our model averagely outperforms baselines by 7.0$%$ in accuracy.