Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?


Abstract in English

The 2D heatmap representation has dominated human pose estimation for years due to its high performance. However, heatmap-based approaches have some drawbacks: 1) The performance drops dramatically in the low-resolution images, which are frequently encountered in real-world scenarios. 2) To improve the localization precision, multiple upsample layers may be needed to recover the feature map resolution from low to high, which are computationally expensive. 3) Extra coordinate refinement is usually necessary to reduce the quantization error of downscaled heatmaps. To address these issues, we propose a textbf{Sim}ple yet promising textbf{D}isentangled textbf{R}epresentation for keypoint coordinate (emph{SimDR}), reformulating human keypoint localization as a task of classification. In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement. Comprehensive experiments conducted over COCO dataset show that the proposed emph{heatmap-free} methods outperform emph{heatmap-based} counterparts in all tested input resolutions, especially in lower resolutions by a large margin. Code will be made publicly available at url{https://github.com/leeyegy/SimDR}.

Download