3D Human Pose Estimation with Spatial and Temporal Transformers

122 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chen Chen

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ce Zheng - Sijie Zhu - Matias Mendieta

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي تفاعل الإنسان والحاسوب

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames, then output an accurate 3D human pose of the center frame. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets. Code is available at url{https://github.com/zczcwh/PoseFormer}

قيم البحث

140 - Wenhao Li , Hong Liu , Runwei Ding 2021

Despite great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of redundant 2D pose sequences to learn representative representation for generating one single 3D pose. To this end, we propose an imp roved Transformer-based architecture, called Strided Transformer, for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, strided convolutions are incorporated into VTE to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both the full sequence scale and single target frame scale, applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and improves the representation ability of features for the target frame. The proposed architecture is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.

الرؤية الحاسوبية وتمييز الأنماط

3D Human Texture Estimation from a Single Image with Transformers

122 - Xiangyu Xu , Chen Change Loy 2021

We propose a Transformer-based framework for 3D human texture estimation from a single image. The proposed Transformer is able to effectively exploit the global information of the input image, overcoming the limitations of existing methods that are s olely based on convolutional neural networks. In addition, we also propose a mask-fusion strategy to combine the advantages of the RGB-based and texture-flow-based models. We further introduce a part-style loss to help reconstruct high-fidelity colors without introducing unpleasant artifacts. Extensive experiments demonstrate the effectiveness of the proposed method against state-of-the-art 3D human texture estimation approaches both quantitatively and qualitatively.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي الرسم الحاسوبي

Human Pose Estimation with Spatial Contextual Information

360 - Hong Zhang , Hao Ouyang , Shu Liu 2019

We explore the importance of spatial contextual information in human pose estimation. Most state-of-the-art pose networks are trained in a multi-stage manner and produce several auxiliary predictions for deep supervision. With this principle, we pres ent two conceptually simple and yet computational efficient modules, namely Cascade Prediction Fusion (CPF) and Pose Graph Neural Network (PGNN), to exploit underlying contextual information. Cascade prediction fusion accumulates prediction maps from previous stages to extract informative signals. The resulting maps also function as a prior to guide prediction at following stages. To promote spatial correlation among joints, our PGNN learns a structured representation of human pose as a graph. Direct message passing between different joints is enabled and spatial relation is captured. These two modules require very limited computational complexity. Experimental results demonstrate that our method consistently outperforms previous methods on MPII and LSP benchmark.

الرؤية الحاسوبية وتمييز الأنماط

3D Human Pose Estimation with Relational Networks

89 - Sungheon Park , Nojun Kwak 2018

In this paper, we propose a novel 3D human pose estimation algorithm from a single image based on neural networks. We adopted the structure of the relational networks in order to capture the relations among different body parts. In our method, each p air of different body parts generates features, and the average of the features from all the pairs are used for 3D pose estimation. In addition, we propose a dropout method that can be used in relational modules, which inherently imposes robustness to the occlusions. The proposed network achieves state-of-the-art performance for 3D pose estimation in Human 3.6M dataset, and it effectively produces plausible results even in the existence of missing joints.

الرؤية الحاسوبية وتمييز الأنماط

Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

93 - Helge Rhodin , Mathieu Salzmann , Pascal Fua 2018

Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed. In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a mapping from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي