ترغب بنشر مسار تعليمي؟ اضغط هنا

Unsupervised Deep Persistent Monocular Visual Odometry and Depth Estimation in Extreme Environments

160   0   0.0 ( 0 )
 نشر من قبل Yasin Almalioglu
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

In recent years, unsupervised deep learning approaches have received significant attention to estimate the depth and visual odometry (VO) from unlabelled monocular image sequences. However, their performance is limited in challenging environments due to perceptual degradation, occlusions and rapid motions. Moreover, the existing unsupervised methods suffer from the lack of scale-consistency constraints across frames, which causes that the VO estimators fail to provide persistent trajectories over long sequences. In this study, we propose an unsupervised monocular deep VO framework that predicts six-degrees-of-freedom pose camera motion and depth map of the scene from unlabelled RGB image sequences. We provide detailed quantitative and qualitative evaluations of the proposed framework on a) a challenging dataset collected during the DARPA Subterranean challenge; and b) the benchmark KITTI and Cityscapes datasets. The proposed approach outperforms both traditional and state-of-the-art unsupervised deep VO methods providing better results for both pose estimation and depth recovery. The presented approach is part of the solution used by the COSTAR team participating at the DARPA Subterranean Challenge.

قيم البحث

اقرأ أيضاً

In the last decade, numerous supervised deep learning approaches requiring large amounts of labeled data have been proposed for visual-inertial odometry (VIO) and depth map estimation. To overcome the data limitation, self-supervised learning has eme rged as a promising alternative, exploiting constraints such as geometric and photometric consistency in the scene. In this study, we introduce a novel self-supervised deep learning-based VIO and depth map recovery approach (SelfVIO) using adversarial training and self-adaptive visual-inertial sensor fusion. SelfVIO learns to jointly estimate 6 degrees-of-freedom (6-DoF) ego-motion and a depth map of the scene from unlabeled monocular RGB image sequences and inertial measurement unit (IMU) readings. The proposed approach is able to perform VIO without the need for IMU intrinsic parameters and/or the extrinsic calibration between the IMU and the camera. estimation and single-view depth recovery network. We provide comprehensive quantitative and qualitative evaluations of the proposed framework comparing its performance with state-of-the-art VIO, VO, and visual simultaneous localization and mapping (VSLAM) approaches on the KITTI, EuRoC and Cityscapes datasets. Detailed comparisons prove that SelfVIO outperforms state-of-the-art VIO approaches in terms of pose estimation and depth recovery, making it a promising approach among existing methods in the literature.
Previous unsupervised monocular depth estimation methods mainly focus on the day-time scenario, and their frameworks are driven by warped photometric consistency. While in some challenging environments, like night, rainy night or snowy winter, the ph otometry of the same pixel on different frames is inconsistent because of the complex lighting and reflection, so that the day-time unsupervised frameworks cannot be directly applied to these complex scenarios. In this paper, we investigate the problem of unsupervised monocular depth estimation in certain highly complex scenarios. We address this challenging problem by using domain adaptation, and a unified image transfer-based adaptation framework is proposed based on monocular videos in this paper. The depth model trained on day-time scenarios is adapted to different complex scenarios. Instead of adapting the whole depth network, we just consider the encoder network for lower computational complexity. The depth models adapted by the proposed framework to different scenarios share the same decoder, which is practical. Constraints on both feature space and output space promote the framework to learn the key features for depth decoding, and the smoothness loss is introduced into the adaptation framework for better depth estimation performance. Extensive experiments show the effectiveness of the proposed unsupervised framework in estimating the dense depth map from the night-time, rainy night-time and snowy winter images.
Deep learning techniques hold promise to develop dense topography reconstruction and pose estimation methods for endoscopic videos. However, currently available datasets do not support effective quantitative benchmarking. In this paper, we introduce a comprehensive endoscopic SLAM dataset consisting of 3D point cloud data for six porcine organs, capsule and standard endoscopy recordings as well as synthetically generated data. A Panda robotic arm, two commercially available capsule endoscopes, two conventional endoscopes with different camera properties, and two high precision 3D scanners were employed to collect data from 8 ex-vivo porcine gastrointestinal (GI)-tract organs. In total, 35 sub-datasets are provided with 6D pose ground truth for the ex-vivo part: 18 sub-dataset for colon, 12 sub-datasets for stomach and 5 sub-datasets for small intestine, while four of these contain polyp-mimicking elevations carried out by an expert gastroenterologist. Synthetic capsule endoscopy frames from GI-tract with both depth and pose annotations are included to facilitate the study of simulation-to-real transfer learning algorithms. Additionally, we propound Endo-SfMLearner, an unsupervised monocular depth and pose estimation method that combines residual networks with spatial attention module in order to dictate the network to focus on distinguishable and highly textured tissue regions. The proposed approach makes use of a brightness-aware photometric loss to improve the robustness under fast frame-to-frame illumination changes. To exemplify the use-case of the EndoSLAM dataset, the performance of Endo-SfMLearner is extensively compared with the state-of-the-art. The codes and the link for the dataset are publicly available at https://github.com/CapsuleEndoscope/EndoSLAM. A video demonstrating the experimental setup and procedure is accessible through https://www.youtube.com/watch?v=G_LCe0aWWdQ.
510 - Xiaochuan Yin , Chengju Liu 2019
For ego-motion estimation, the feature representation of the scenes is crucial. Previous methods indicate that both the low-level and semantic feature-based methods can achieve promising results. Therefore, the incorporation of hierarchical feature r epresentation may benefit from both methods. From this perspective, we propose a novel direct feature odometry framework, named DFO, for depth estimation and hierarchical feature representation learning from monocular videos. By exploiting the metric distance, our framework is able to learn the hierarchical feature representation without supervision. The pose is obtained with a coarse-to-fine approach from high-level to low-level features in enlarged feature maps. The pixel-level attention mask can be self-learned to provide the prior information. In contrast to the previous methods, our proposed method calculates the camera motion with a direct method rather than regressing the ego-motion from the pose network. With this approach, the consistency of the scale factor of translation can be constrained. Additionally, the proposed method is thus compatible with the traditional SLAM pipeline. Experiments on the KITTI dataset demonstrate the effectiveness of our method.
In this work, we propose a novel deep online correction (DOC) framework for monocular visual odometry. The whole pipeline has two stages: First, depth maps and initial poses are obtained from convolutional neural networks (CNNs) trained in self-super vised manners. Second, the poses predicted by CNNs are further improved by minimizing photometric errors via gradient updates of poses during inference phases. The benefits of our proposed method are twofold: 1) Different from online-learning methods, DOC does not need to calculate gradient propagation for parameters of CNNs. Thus, it saves more computation resources during inference phases. 2) Unlike hybrid methods that combine CNNs with traditional methods, DOC fully relies on deep learning (DL) frameworks. Though without complex back-end optimization modules, our method achieves outstanding performance with relative transform error (RTE) = 2.0% on KITTI Odometry benchmark for Seq. 09, which outperforms traditional monocular VO frameworks and is comparable to hybrid methods.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا