ترغب بنشر مسار تعليمي؟ اضغط هنا

Most existing human pose estimation (HPE) methods exploit multi-scale information by fusing feature maps of four different spatial sizes, ie $1/4$, $1/8$, $1/16$, and $1/32$ of the input image. There are two drawbacks of this strategy: 1) feature map s of different spatial sizes may be not well aligned spatially, which potentially hurts the accuracy of keypoint location; 2) these scales are fixed and inflexible, which may restrict the generalization ability over various human sizes. Towards these issues, we propose an adaptive dilated convolution (ADC). It can generate and fuse multi-scale features of the same spatial sizes by setting different dilation rates for different channels. More importantly, these dilation rates are generated by a regression module. It enables ADC to adaptively adjust the fused scales and thus ADC may generalize better to various human sizes. ADC can be end-to-end trained and easily plugged into existing methods. Extensive experiments show that ADC can bring consistent improvements to various HPE methods. The source codes will be released for further research.
Heatmap regression has become the most prevalent choice for nowadays human pose estimation methods. The ground-truth heatmaps are usually constructed via covering all skeletal keypoints by 2D gaussian kernels. The standard deviations of these kernels are fixed. However, for bottom-up methods, which need to handle a large variance of human scales and labeling ambiguities, the current practice seems unreasonable. To better cope with these problems, we propose the scale-adaptive heatmap regression (SAHR) method, which can adaptively adjust the standard deviation for each keypoint. In this way, SAHR is more tolerant of various human scales and labeling ambiguities. However, SAHR may aggravate the imbalance between fore-background samples, which potentially hurts the improvement of SAHR. Thus, we further introduce the weight-adaptive heatmap regression (WAHR) to help balance the fore-background samples. Extensive experiments show that SAHR together with WAHR largely improves the accuracy of bottom-up human pose estimation. As a result, we finally outperform the state-of-the-art model by +1.5AP and achieve 72.0AP on COCO test-dev2017, which is com-arable with the performances of most top-down methods. Source codes are available at https://github.com/greatlog/SWAHR-HumanPose.
In this paper, we propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations. Most existing models explore multi-scale information mainly from features with different spatial sizes. Powerful multi-scale r epresentations usually rely on the cascaded pyramid framework. This framework largely boosts the performance but in the meanwhile makes networks very deep and complex. Instead, we focus on exploiting multi-scale information from layers with different receptive-field sizes and then making full of use this information by improving the fusion method. Specifically, we propose an orthogonal attention block (OAB) and a second-order fusion unit (SFU). The OAB learns multi-scale information from different layers and enhances them by encouraging them to be diverse. The SFU adaptively selects and fuses diverse multi-scale information and suppress the redundant ones. This could maximize the effective information in final fused representations. With the help of OAB and SFU, our single pyramid network may be able to generate deeply aggregated representations that contain even richer multi-scale information and have a larger representing capacity than that of cascaded networks. Thus, our networks could achieve comparable or even better accuracy with much smaller model complexity. Specifically, our mbox{DANet-72} achieves $70.5$ in AP score on COCO test-dev set with only $1.0G$ FLOPs. Its speed on a CPU platform achieves $58$ Persons-Per-Second~(PPS).
In this paper, we propose a novel method called Residual Steps Network (RSN). RSN aggregates features with the same spatial size (Intra-level features) efficiently to obtain delicate local representations, which retain rich low-level spatial informat ion and result in precise keypoint localization. Additionally, we observe the output features contribute differently to final performance. To tackle this problem, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to make a trade-off between local and global representations in output features and further refine the keypoint locations. Our approach won the 1st place of COCO Keypoint Challenge 2019 and achieves state-of-the-art results on both COCO and MPII benchmarks, without using extra training data and pretrained model. Our single model achieves 78.6 on COCO test-dev, 93.0 on MPII test dataset. Ensembled models achieve 79.2 on COCO test-dev, 77.1 on COCO test-challenge dataset. The source code is publicly available for further research at https://github.com/caiyuanhao1998/RSN/
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا