3D Bounding Box Estimation Using Deep Learning and Geometry

364 0 0.0 ( 0 )

Download Cite

Added by Arsalan Mousavian

Publication date 2016

fields Informatics Engineering

and research's language is English

Authors Arsalan Mousavian - Dragomir Anguelov - John Flynn

Computer Vision and Pattern Recognition

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors and sub-category detection. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset.

rate research

Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels

203 - L. Koestler , N. Yang , R. Wang 2020

The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. We propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. By representing the objects as triangular meshes and employing differentiable shape rendering, we define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, off-the-shelf networks. We evaluate the proposed algorithm on the real-world KITTI dataset and achieve promising performance in comparison to state-of-the-art methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.

Computer Vision and Pattern Recognition

Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation

73 - Bin Yan , Xinyu Zhang , Dong Wang 2020

Visual object tracking aims to precisely estimate the bounding box for the given target, which is a challenging problem due to factors such as deformation and occlusion. Many recent trackers adopt the multiple-stage tracking strategy to improve the quality of bounding box estimation. These methods first coarsely locate the target and then refine the initial prediction in the following stages. However, existing approaches still suffer from limited precision, and the coupling of different stages severely restricts the methods transferability. This work proposes a novel, flexible, and accurate refinement module called Alpha-Refine (AR), which can significantly improve the base trackers box estimation quality. By exploring a series of design options, we conclude that the key to successful refinement is extracting and maintaining detailed spatial information as much as possible. Following this principle, Alpha-Refine adopts a pixel-wise correlation, a corner prediction head, and an auxiliary mask head as the core components. Comprehensive experiments on TrackingNet, LaSOT, GOT-10K, and VOT2020 benchmarks with multiple base trackers show that our approach significantly improves the base trackers performance with little extra latency. The proposed Alpha-Refine method leads to a series of strengthened trackers, among which the ARSiamRPN (AR strengthened SiamRPNpp) and the ARDiMP50 (ARstrengthened DiMP50) achieve good efficiency-precision trade-off, while the ARDiMPsuper (AR strengthened DiMP-super) achieves very competitive performance at a real-time speed. Code and pretrained models are available at https://github.com/MasterBin-IIAU/AlphaRefine.

Computer Vision and Pattern Recognition

Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation

68 - Bin Yan , Dong Wang , Huchuan Lu 2020

In recent years, the multiple-stage strategy has become a popular trend for visual tracking. This strategy first utilizes a base tracker to coarsely locate the target and then exploits a refinement module to obtain more accurate results. However, existing refinement modules suffer from the limited transferability and precision. In this work, we propose a novel, flexible and accurate refinement module called Alpha-Refine, which exploits a precise pixel-wise correlation layer together with a spatial-aware non-local layer to fuse features and can predict three complementary outputs: bounding box, corners and mask. To wisely choose the most adequate output, we also design a light-weight branch selector module. We apply the proposed Alpha-Refine module to five famous and state-of-the-art base trackers: DiMP, ATOM, SiamRPN++, RTMDNet and ECO. The comprehensive experiments on TrackingNet, LaSOT and VOT2018 benchmarks demonstrate that our approach significantly improves the tracking performance in comparison with other existing refinement methods. The source codes will be available at https://github.com/MasterBin-IIAU/AlphaRefine.

Computer Vision and Pattern Recognition

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

172 - Yang You , Zelin Ye , Yujing Lou 2020

3D object detection has attracted much attention thanks to the advances in sensors and deep learning methods for point clouds. Current state-of-the-art methods like VoteNet regress direct offset towards object centers and box orientations with an additional Multi-Layer-Perceptron network. Both their offset and orientation predictions are not accurate due to the fundamental difficulty in rotation classification. In the work, we disentangle the direct offset into Local Canonical Coordinates (LCC), box scales and box orientations. Only LCC and box scales are regressed while box orientations are generated by a canonical voting scheme. Finally, a LCC-aware back-projection checking algorithm iteratively cuts out bounding boxes from the generated vote maps, with the elimination of false positives. Our model achieves state-of-the-art performance on challenging large-scale datasets of real point cloud scans: ScanNet, SceneNN with 8.8 and 5.1 mAP improvement respectively. Code is available on https://github.com/qq456cvb/CanonicalVoting.

Computer Vision and Pattern Recognition

Improving Head Pose Estimation with a Combined Loss and Bounding Box Margin Adjustment

71 - Mingzhen Shao , Zhun Sun , Mete Ozay 2019

We address a problem of estimating pose of a persons head from its RGB image. The employment of CNNs for the problem has contributed to significant improvement in accuracy in recent works. However, we show that the following two methods, despite their simplicity, can attain further improvement: (i) proper adjustment of the margin of bounding box of a detected face, and (ii) choice of loss functions. We show that the integration of these two methods achieve the new state-of-the-art on standard benchmark datasets for in-the-wild head pose estimation.

Computer Vision and Pattern Recognition