ترغب بنشر مسار تعليمي؟ اضغط هنا

3D-BEVIS: Birds-Eye-View Instance Segmentation

112   0   0.0 ( 0 )
 نشر من قبل Cathrin Elich
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent deep learning models achieve impressive results on 3D scene analysis tasks by operating directly on unstructured point clouds. A lot of progress was made in the field of object classification and semantic segmentation. However, the task of instance segmentation is less explored. In this work, we present 3D-BEVIS, a deep learning framework for 3D semantic instance segmentation on point clouds. Following the idea of previous proposal-free instance segmentation approaches, our model learns a feature embedding and groups the obtained feature space into semantic instances. Current point-based methods scale linearly with the number of points by processing local sub-parts of a scene individually. However, to perform instance segmentation by clustering, globally consistent features are required. Therefore, we propose to combine local point geometry with global context information from an intermediate birds-eye view representation.

قيم البحث

اقرأ أيضاً

Birds-Eye-View (BEV) maps have emerged as one of the most powerful representations for scene understanding due to their ability to provide rich spatial context while being easy to interpret and process. However, generating BEV maps requires complex m ulti-stage paradigms that encapsulate a series of distinct tasks such as depth estimation, ground plane estimation, and semantic segmentation. These sub-tasks are often learned in a disjoint manner which prevents the model from holistic reasoning and results in erroneous BEV maps. Moreover, existing algorithms only predict the semantics in the BEV space, which limits their use in applications where the notion of object instances is critical. In this work, we present the first end-to-end learning approach for directly predicting dense panoptic segmentation maps in the BEV, given a single monocular image in the frontal view (FV). Our architecture follows the top-down paradigm and incorporates a novel dense transformer module consisting of two distinct transformers that learn to independently map vertical and flat regions in the input image from the FV to the BEV. Additionally, we derive a mathematical formulation for the sensitivity of the FV-BEV transformation which allows us to intelligently weight pixels in the BEV space to account for the varying descriptiveness across the FV image. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach exceeds the state-of-the-art in the PQ metric by 3.61 pp and 4.93 pp respectively.
Birds-eye-view (BEV) is a powerful and widely adopted representation for road scenes that captures surrounding objects and their spatial locations, along with overall context in the scene. In this work, we focus on birds eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. This task is made possible by simulators such as Carla, which allow for cheap data collection, arbitrary camera placements, and supervision in ways otherwise not possible in the real world. There are two main challenges to this task: the view transformation from side view to birds eye view, as well as transfer learning to unseen domains. Existing work transforms between views through fully connected layers and transfer learns via GANs. This suffers from a lack of depth reasoning and performance degradation across domains. Our novel 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects spatial locations in the BEV. In addition, we transfer learning by abstracting high-level geometric features and predicting an intermediate representation that is common across different domains. We publish a new dataset called BEVSEG-Carla and show that our approach improves state-of-the-art by 24% mIoU and performs well when transferred to a new domain.
131 - Lun Luo , Si-Yuan Cao , Bin Han 2021
Recognizing places using Lidar in large-scale environments is challenging due to the sparse nature of point cloud data. In this paper we present BVMatch, a Lidar-based frame-to-frame place recognition framework, that is capable of estimating 2D relat ive poses. Based on the assumption that the ground area can be approximated as a plane, we uniformly discretize the ground area into grids and project 3D Lidar scans to birds-eye view (BV) images. We further use a bank of Log-Gabor filters to build a maximum index map (MIM) that encodes the orientation information of the structures in the images. We analyze the orientation characteristics of MIM theoretically and introduce a novel descriptor called birds-eye view feature transform (BVFT). The proposed BVFT is insensitive to rotation and intensity variations of BV images. Leveraging the BVFT descriptors, we unify the Lidar place recognition and pose estimation tasks into the BVMatch framework. The experiments conducted on three large-scale datasets show that BVMatch outperforms the state-of-the-art methods in terms of both recall rate of place recognition and pose estimation accuracy.
3D object detection based on LiDAR point clouds is a crucial module in autonomous driving particularly for long range sensing. Most of the research is focused on achieving higher accuracy and these models are not optimized for deployment on embedded systems from the perspective of latency and power efficiency. For high speed driving scenarios, latency is a crucial parameter as it provides more time to react to dangerous situations. Typically a voxel or point-cloud based 3D convolution approach is utilized for this module. Firstly, they are inefficient on embedded platforms as they are not suitable for efficient parallelization. Secondly, they have a variable runtime due to level of sparsity of the scene which is against the determinism needed in a safety system. In this work, we aim to develop a very low latency algorithm with fixed runtime. We propose a novel semantic segmentation architecture as a single unified model for object center detection using key points, box predictions and orientation prediction using binned classification in a simpler Birds Eye View (BEV) 2D representation. The proposed architecture can be trivially extended to include semantic segmentation classes like road without any additional computation. The proposed model has a latency of 4 ms on the embedded Nvidia Xavier platform. The model is 5X faster than other top accuracy models with a minimal accuracy degradation of 2% in Average Precision at IoU=0.5 on KITTI dataset.
Instance segmentation on point clouds is a fundamental task in 3D scene perception. In this work, we propose a concise clustering-based framework named HAIS, which makes full use of spatial relation of points and point sets. Considering clustering-ba sed methods may result in over-segmentation or under-segmentation, we introduce the hierarchical aggregation to progressively generate instance proposals, i.e., point aggregation for preliminarily clustering points to sets and set aggregation for generating complete instances from sets. Once the complete 3D instances are obtained, a sub-network of intra-instance prediction is adopted for noisy points filtering and mask quality scoring. HAIS is fast (only 410ms per frame) and does not require non-maximum suppression. It ranks 1st on the ScanNet v2 benchmark, achieving the highest 69.9% AP50 and surpassing previous state-of-the-art (SOTA) methods by a large margin. Besides, the SOTA results on the S3DIS dataset validate the good generalization ability. Code will be available at https://github.com/hustvl/HAIS.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا