No Arabic abstract
Pseudo-LiDAR point cloud interpolation is a novel and challenging task in the field of autonomous driving, which aims to address the frequency mismatching problem between camera and LiDAR. Previous works represent the 3D spatial motion relationship induced by a coarse 2D optical flow, and the quality of interpolated point clouds only depends on the supervision of depth maps. As a result, the generated point clouds suffer from inferior global distributions and local appearances. To solve the above problems, we propose a Pseudo-LiDAR point cloud interpolation network to generates temporally and spatially high-quality point cloud sequences. By exploiting the scene flow between point clouds, the proposed network is able to learn a more accurate representation of the 3D spatial motion relationship. For the more comprehensive perception of the distribution of point cloud, we design a novel reconstruction loss function that implements the chamfer distance to supervise the generation of Pseudo-LiDAR point clouds in 3D space. In addition, we introduce a multi-modal deep aggregation module to facilitate the efficient fusion of texture and depth features. As the benefits of the improved motion representation, training loss function, and model structure, our approach gains significant improvements on the Pseudo-LiDAR point cloud interpolation task. The experimental results evaluated on KITTI dataset demonstrate the state-of-the-art performance of the proposed network, quantitatively and qualitatively.
The recently proposed pseudo-LiDAR based 3D detectors greatly improve the benchmark of monocular/stereo 3D detection task. However, the underlying mechanism remains obscure to the research community. In this paper, we perform an in-depth investigation and observe that the efficacy of pseudo-LiDAR representation comes from the coordinate transformation, instead of data representation itself. Based on this observation, we design an image based CNN detector named Patch-Net, which is more generalized and can be instantiated as pseudo-LiDAR based 3D detectors. Moreover, the pseudo-LiDAR data in our PatchNet is organized as the image representation, which means existing 2D CNN designs can be easily utilized for extracting deep features from input data and boosting 3D detection performance. We conduct extensive experiments on the challenging KITTI dataset, where the proposed PatchNet outperforms all existing pseudo-LiDAR based counterparts. Code has been made available at: https://github.com/xinzhuma/patchnet.
Projecting the point cloud on the 2D spherical range image transforms the LiDAR semantic segmentation to a 2D segmentation task on the range image. However, the LiDAR range image is still naturally different from the regular 2D RGB image; for example, each position on the range image encodes the unique geometry information. In this paper, we propose a new projection-based LiDAR semantic segmentation pipeline that consists of a novel network structure and an efficient post-processing step. In our network structure, we design a FID (fully interpolation decoding) module that directly upsamples the multi-resolution feature maps using bilinear interpolation. Inspired by the 3D distance interpolation used in PointNet++, we argue this FID module is a 2D version distance interpolation on $(theta, phi)$ space. As a parameter-free decoding module, the FID largely reduces the model complexity by maintaining good performance. Besides the network structure, we empirically find that our model predictions have clear boundaries between different semantic classes. This makes us rethink whether the widely used K-nearest-neighbor post-processing is still necessary for our pipeline. Then, we realize the many-to-one mapping causes the blurring effect that some points are mapped into the same pixel and share the same label. Therefore, we propose to process those occluded points by assigning the nearest predicted label to them. This NLA (nearest label assignment) post-processing step shows a better performance than KNN with faster inference speed in the ablation study. On the SemanticKITTI dataset, our pipeline achieves the best performance among all projection-based methods with $64 times 2048$ resolution and all point-wise solutions. With a ResNet-34 as the backbone, both the training and testing of our model can be finished on a single RTX 2080 Ti with 11G memory. The code is released.
Monocular 3D detection currently struggles with extremely lower detection rates compared to LiDAR-based methods. The poor accuracy is mainly caused by the absence of accurate location cues due to the ill-posed nature of monocular imagery. LiDAR point clouds, which provide precise spatial measurement, can offer beneficial information for the training of monocular methods. To make use of LiDAR point clouds, prior works project them to form depth map labels, subsequently training a dense depth estimator to extract explicit location features. This indirect and complicated way introduces intermediate products, i.e., depth map predictions, taking much computation costs as well as leading to suboptimal performances. In this paper, we propose LPCG (LiDAR point cloud guided monocular 3D object detection), which is a general framework for guiding the training of monocular 3D detectors with LiDAR point clouds. Specifically, we use LiDAR point clouds to generate pseudo labels, allowing monocular 3D detectors to benefit from easy-collected massive unlabeled data. LPCG works well under both supervised and unsupervised setups. Thanks to a general design, LPCG can be plugged into any monocular 3D detector, significantly boosting the performance. As a result, we take the first place on KITTI monocular 3D/BEV (birds-eye-view) detection benchmark with a considerable margin. The code will be made publicly available soon.
In this preliminary work we attempt to apply submanifold sparse convolution to the task of 3D person detection. In particular, we present Person-MinkUNet, a single-stage 3D person detection network based on Minkowski Engine with U-Net architecture. The network achieves a 76.4% average precision (AP) on the JRDB 3D detection benchmark.
It is laborious to manually label point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes, associated with a few precisely labeled object instances. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under weak supervision, i.e., only the horizontal centers of objects are click-annotated on birds view scenes. Stage-2 learns to refine the cylindrical proposals to get cuboids and confidence scores, using a few well-labeled object instances. Using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 85-95% the performance of current top-leading, fully supervised detectors (which require 3, 712 exhaustively and precisely annotated scenes with 15, 654 instances). More importantly, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, allowing both automatic and active working modes. The annotations generated by our model can be used to train 3D object detectors with over 94% of their original performance (under manually labeled data). Our experiments also show our models potential in boosting performance given more training data. Above designs make our approach highly practical and introduce new opportunities for learning 3D object detection with reduced annotation burden.