ترغب بنشر مسار تعليمي؟ اضغط هنا

MVHM: A Large-Scale Multi-View Hand Mesh Benchmark for Accurate 3D Hand Pose Estimation

101   0   0.0 ( 0 )
 نشر من قبل Liangjian Chen
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Estimating 3D hand poses from a single RGB image is challenging because depth ambiguity leads the problem ill-posed. Training hand pose estimators with 3D hand mesh annotations and multi-view images often results in significant performance gains. However, existing multi-view datasets are relatively small with hand joints annotated by off-the-shelf trackers or automated through model predictions, both of which may be inaccurate and can introduce biases. Collecting a large-scale multi-view 3D hand pose images with accurate mesh and joint annotations is valuable but strenuous. In this paper, we design a spin match algorithm that enables a rigid mesh model matching with any target mesh ground truth. Based on the match algorithm, we propose an efficient pipeline to generate a large-scale multi-view hand mesh (MVHM) dataset with accurate 3D hand mesh and joint labels. We further present a multi-view hand pose estimation approach to verify that training a hand pose estimator with our generated dataset greatly enhances the performance. Experimental results show that our approach achieves the performance of 0.990 in $text{AUC}_{text{20-50}}$ on the MHP dataset compared to the previous state-of-the-art of 0.939 on this dataset. Our datasset is public available. footnote{url{https://github.com/Kuzphi/MVHM}} Our datasset is available at~href{https://github.com/Kuzphi/MVHM}{color{blue}{https://github.com/Kuzphi/MVHM}}.



قيم البحث

اقرأ أيضاً

Accurate hand pose estimation at joint level has several uses on human-robot interaction, user interfacing and virtual reality applications. Yet, it currently is not a solved problem. The novel deep learning techniques could make a great improvement on this matter but they need a huge amount of annotated data. The hand pose datasets released so far present some issues that make them impossible to use on deep learning methods such as the few number of samples, high-level abstraction annotations or samples consisting in depth maps. In this work, we introduce a multiview hand pose dataset in which we provide color images of hands and different kind of annotations for each, i.e the bounding box and the 2D and 3D location on the joints in the hand. Besides, we introduce a simple yet accurate deep learning architecture for real-time robust 2D hand pose estimation.
We propose a Bayesian approximation to a deep learning architecture for 3D hand pose estimation. Through this framework, we explore and analyse the two types of uncertainties that are influenced either by data or by the learning capability. Furthermo re, we draw comparisons against the standard estimator over three popular benchmarks. The first contribution lies in outperforming the baseline while in the second part we address the active learning application. We also show that with a newly proposed acquisition function, our Bayesian 3D hand pose estimator obtains lowest errors with the least amount of data. The underlying code is publicly available at https://github.com/razvancaramalau/al_bhpe.
Estimating 3D hand pose directly from RGB imagesis challenging but has gained steady progress recently bytraining deep models with annotated 3D poses. Howeverannotating 3D poses is difficult and as such only a few 3Dhand pose datasets are available, all with limited samplesizes. In this study, we propose a new framework of training3D pose estimation models from RGB images without usingexplicit 3D annotations, i.e., trained with only 2D informa-tion. Our framework is motivated by two observations: 1)Videos provide richer information for estimating 3D posesas opposed to static images; 2) Estimated 3D poses oughtto be consistent whether the videos are viewed in the for-ward order or reverse order. We leverage these two obser-vations to develop a self-supervised learning model calledtemporal-aware self-supervised network (TASSN). By en-forcing temporal consistency constraints, TASSN learns 3Dhand poses and meshes from videos with only 2D keypointposition annotations. Experiments show that our modelachieves surprisingly good results, with 3D estimation ac-curacy on par with the state-of-the-art models trained with3D annotations, highlighting the benefit of the temporalconsistency in constraining 3D prediction models.
In this paper, we propose an adaptive weighting regression (AWR) method to leverage the advantages of both detection-based and regression-based methods. Hand joint coordinates are estimated as discrete integration of all pixels in dense representatio n, guided by adaptive weight maps. This learnable aggregation process introduces both dense and joint supervision that allows end-to-end training and brings adaptability to weight maps, making the network more accurate and robust. Comprehensive exploration experiments are conducted to validate the effectiveness and generality of AWR under various experimental settings, especially its usefulness for different types of dense representation and input modality. Our method outperforms other state-of-the-art methods on four publicly available datasets, including NYU, ICVL, MSRA and HANDS 2017 dataset.
3D hand-mesh reconstruction from RGB images facilitates many applications, including augmented reality (AR). However, this requires not only real-time speed and accurate hand pose and shape but also plausible mesh-image alignment. While existing work s already achieve promising results, meeting all three requirements is very challenging. This paper presents a novel pipeline by decoupling the hand-mesh reconstruction task into three stages: a joint stage to predict hand joints and segmentation; a mesh stage to predict a rough hand mesh; and a refine stage to fine-tune it with an offset mesh for mesh-image alignment. With careful design in the network structure and in the loss functions, we can promote high-quality finger-level mesh-image alignment and drive the models together to deliver real-time predictions. Extensive quantitative and qualitative results on benchmark datasets demonstrate that the quality of our results outperforms the state-of-the-art methods on hand-mesh/pose precision and hand-image alignment. In the end, we also showcase several real-time AR scenarios.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا