No Arabic abstract
We propose an approach for improving sequence modeling based on autoregressive normalizing flows. Each autoregressive transform, acting across time, serves as a moving frame of reference, removing temporal correlations, and simplifying the modeling of higher-level dynamics. This technique provides a simple, general-purpose method for improving sequence modeling, with connections to existing and classical techniques. We demonstrate the proposed approach both with standalone flow-based models and as a component within sequential latent variable models. Results are presented on three benchmark video datasets, where autoregressive flow-based dynamics improve log-likelihood performance over baseline models. Finally, we illustrate the decorrelation and improved generalization properties of using flow-based dynamics.
The recurrent neural networks (RNN) with richly distributed internal states and flexible non-linear transition functions, have overtaken the dynamic Bayesian networks such as the hidden Markov models (HMMs) in the task of modeling highly structured sequential data. These data, such as from speech and handwriting, often contain complex relationships between the underlaying variational factors and the observed data. The standard RNN model has very limited randomness or variability in its structure, coming from the output conditional probability model. This paper will present different ways of using high level latent random variables in RNN to model the variability in the sequential data, and the training method of such RNN model under the VAE (Variational Autoencoder) principle. We will explore possible ways of using adversarial method to train a variational RNN model. Contrary to competing approaches, our approach has theoretical optimum in the model training and provides better model training stability. Our approach also improves the posterior approximation in the variational inference network by a separated adversarial training step. Numerical results simulated from TIMIT speech data show that reconstruction loss and evidence lower bound converge to the same level and adversarial training loss converges to 0.
Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e.g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter -- a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction.
Robust multi-agent trajectory prediction is essential for the safe control of robots and vehicles that interact with humans. Many existing methods treat social and temporal information separately and therefore fall short of modelling the joint future trajectories of all agents in a socially consistent way. To address this, we propose a new class of Latent Variable Sequential Set Transformers which autoregressively model multi-agent trajectories. We refer to these architectures as AutoBots. AutoBots model the contents of sets (e.g. representing the properties of agents in a scene) over time and employ multi-head self-attention blocks over these sequences of sets to encode the sociotemporal relationships between the different actors of a scene. This produces either the trajectory of one ego-agent or a distribution over the future trajectories for all agents under consideration. Our approach works for general sequences of sets and we provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. For the single-agent prediction case, we validate our model on the NuScenes motion prediction task and achieve competitive results on the global leaderboard. In the multi-agent forecasting setting, we validate our model on TrajNet. We find that our method outperforms physical extrapolation and recurrent network baselines and generates scene-consistent trajectories.
Kernel classifiers and regressors designed for structured data, such as sequences, trees and graphs, have significantly advanced a number of interdisciplinary areas such as computational biology and drug design. Typically, kernels are designed beforehand for a data type which either exploit statistics of the structures or make use of probabilistic generative models, and then a discriminative classifier is learned based on the kernels via convex optimization. However, such an elegant two-stage approach also limited kernel methods from scaling up to millions of data points, and exploiting discriminative information to learn feature representations. We propose, structure2vec, an effective and scalable approach for structured data representation based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Interestingly, structure2vec extracts features by performing a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving millions of data points, we showed that structure2vec runs 2 times faster, produces models which are $10,000$ times smaller, while at the same time achieving the state-of-the-art predictive performance.
We present a graph neural network model for solving graph-to-graph learning problems. Most deep learning on graphs considers ``simple problems such as graph classification or regressing real-valued graph properties. For such tasks, the main requirement for intermediate representations of the data is to maintain the structure needed for output, i.e., keeping classes separated or maintaining the order indicated by the regressor. However, a number of learning tasks, such as regressing graph-valued output, generative models, or graph autoencoders, aim to predict a graph-structured output. In order to successfully do this, the learned representations need to preserve far more structure. We present a conditional auto-regressive model for graph-to-graph learning and illustrate its representational capabilities via experiments on challenging subgraph predictions from graph algorithmics; as a graph autoencoder for reconstruction and visualization; and on pretraining representations that allow graph classification with limited labeled data.