No Arabic abstract
Predicting future human motion plays a significant role in human-machine interactions for a variety of real-life applications. In this paper, we build a deep state-space model, DeepSSM, to predict future human motion. Specifically, we formulate the human motion system as the state-space model of a dynamic system and model the motion system by the state-space theory, offering a unified formulation for diverse human motion systems. Moreover, a novel deep network is designed to build this system, enabling us to utilize both the advantages of deep network and state-space model. The deep network jointly models the process of both the state-state transition and the state-observation transition of the human motion system, and multiple future poses can be generated via the state-observation transition of the model recursively. To improve the modeling ability of the system, a unique loss function, ATPL (Attention Temporal Prediction Loss), is introduced to optimize the model, encouraging the system to achieve more accurate predictions by paying increasing attention to the early time-steps. The experiments on two benchmark datasets (i.e., Human3.6M and 3DPW) confirm that our method achieves state-of-the-art performance with improved effectiveness. The code will be available if the paper is accepted.
Human motion prediction is a challenging and important task in many computer vision application domains. Existing work only implicitly models the spatial structure of the human skeleton. In this paper, we propose a novel approach that decomposes the prediction into individual joints by means of a structured prediction layer that explicitly models the joint dependencies. This is implemented via a hierarchy of small-sized neural networks connected analogously to the kinematic chains in the human body as well as a joint-wise decomposition in the loss function. The proposed layer is agnostic to the underlying network and can be used with existing architectures for motion modelling. Prior work typically leverages the H3.6M dataset. We show that some state-of-the-art techniques do not perform well when trained and tested on AMASS, a recently released dataset 14 times the size of H3.6M. Our experiments indicate that the proposed layer increases the performance of motion forecasting irrespective of the base network, joint-angle representation, and prediction horizon. We furthermore show that the layer also improves motion predictions qualitatively. We make code and models publicly available at https://ait.ethz.ch/projects/2019/spl.
In this paper, we propose a novel Transformer-based architecture for the task of generative modelling of 3D human motion. Previous works commonly rely on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Instead, our focus lies on the generation of plausible future developments over longer time horizons. To mitigate the issue of convergence to a static pose, we propose a novel architecture that leverages the recently proposed self-attention concept. The task of 3D motion prediction is inherently spatio-temporal and thus the proposed model learns high dimensional embeddings for skeletal joints followed by a decoupled temporal and spatial self-attention mechanism. This allows the model to access past information directly and to capture spatio-temporal dependencies explicitly. We show empirically that this reduces error accumulation over time and allows for the generation of perceptually plausible motion sequences over long time horizons up to 20 seconds as well as accurate short-term predictions. Accompanying video available at https://youtu.be/yF0cdt2yCNE.
Human motion prediction from historical pose sequence is at the core of many applications in machine intelligence. However, in current state-of-the-art methods, the predicted future motion is confined within the same activity. One can neither generate predictions that differ from the current activity, nor manipulate the body parts to explore various future possibilities. Undoubtedly, this greatly limits the usefulness and applicability of motion prediction. In this paper, we propose a generalization of the human motion prediction task in which control parameters can be readily incorporated to adjust the forecasted motion. Our method is compelling in that it enables manipulable motion prediction across activity types and allows customization of the human movement in a variety of fine-grained ways. To this aim, a simple yet effective composite GAN structure, consisting of local GANs for different body parts and aggregated via a global GAN is presented. The local GANs game in lower dimensions, while the global GAN adjusts in high dimensional space to avoid mode collapse. Extensive experiments show that our method outperforms state-of-the-art. The codes are available at https://github.com/herolvkd/AM-GAN.
The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out-of-distribution (OoD). Here we formulate a new OoD benchmark based on the Human3.6M and CMU motion capture datasets, and introduce a hybrid framework for hardening discriminative architectures to OoD failure by augmenting them with a generative model. When applied to current state-of-the-art discriminative models, we show that the proposed approach improves OoD robustness without sacrificing in-distribution performance, and can theoretically facilitate model interpretability. We suggest human motion predictors ought to be constructed with OoD challenges in mind, and provide an extensible general framework for hardening diverse discriminative architectures to extreme distributional shift. The code is available at https://github.com/bouracha/OoDMotion.
Human motion prediction, which aims at predicting future human skeletons given the past ones, is a typical sequence-to-sequence problem. Therefore, extensive efforts have been continued on exploring different RNN-based encoder-decoder architectures. However, by generating target poses conditioned on the previously generated ones, these models are prone to bringing issues such as error accumulation problem. In this paper, we argue that such issue is mainly caused by adopting autoregressive manner. Hence, a novel Non-auToregressive Model (NAT) is proposed with a complete non-autoregressive decoding scheme, as well as a context encoder and a positional encoding module. More specifically, the context encoder embeds the given poses from temporal and spatial perspectives. The frame decoder is responsible for predicting each future pose independently. The positional encoding module injects positional signal into the model to indicate temporal order. Moreover, a multitask training paradigm is presented for both low-level human skeleton prediction and high-level human action recognition, resulting in the convincing improvement for the prediction task. Our approach is evaluated on Human3.6M and CMU-Mocap benchmarks and outperforms state-of-the-art autoregressive methods.