No Arabic abstract
Localization is a critically essential and crucial enabler of autonomous robots. While deep learning has made significant strides in many computer vision tasks, it is still yet to make a sizeable impact on improving capabilities of metric visual localization. One of the major hindrances has been the inability of existing Convolutional Neural Network (CNN)-based pose regression methods to generalize to previously unseen places. Our recently introduced CMRNet effectively addresses this limitation by enabling map independent monocular localization in LiDAR-maps. In this paper, we now take it a step further by introducing CMRNet++, which is a significantly more robust model that not only generalizes to new places effectively, but is also independent of the camera parameters. We enable this capability by combining deep learning with geometric techniques, and by moving the metric reasoning outside the learning process. In this way, the weights of the network are not tied to a specific camera. Extensive evaluations of CMRNet++ on three challenging autonomous driving datasets, i.e., KITTI, Argoverse, and Lyft5, show that CMRNet++ outperforms CMRNet as well as other baselines by a large margin. More importantly, for the first-time, we demonstrate the ability of a deep learning approach to accurately localize without any retraining or fine-tuning in a completely new environment and independent of the camera parameters.
Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: http://sites.google.com/view/differentiable-mapping
In complex environments, low-cost and robust localization is a challenging problem. For example, in a GPSdenied environment, LiDAR can provide accurate position information, but the cost is high. In general, visual SLAM based localization methods become unreliable when the sunlight changes greatly. Therefore, inexpensive and reliable methods are required. In this paper, we propose a stereo visual localization method based on the prior LiDAR map. Different from the conventional visual localization system, we design a novel visual optimization model by matching planar information between the LiDAR map and visual image. Bundle adjustment is built by using coplanarity constraints. To solve the optimization problem, we use a graph-based optimization algorithm and a local window optimization method. Finally, we estimate a full six degrees of freedom (DOF) pose without scale drift. To validate the efficiency, the proposed method has been tested on the KITTI dataset. The results show that our method is more robust and accurate than the state-of-art ORB-SLAM2.
In this work, we propose a novel deep online correction (DOC) framework for monocular visual odometry. The whole pipeline has two stages: First, depth maps and initial poses are obtained from convolutional neural networks (CNNs) trained in self-supervised manners. Second, the poses predicted by CNNs are further improved by minimizing photometric errors via gradient updates of poses during inference phases. The benefits of our proposed method are twofold: 1) Different from online-learning methods, DOC does not need to calculate gradient propagation for parameters of CNNs. Thus, it saves more computation resources during inference phases. 2) Unlike hybrid methods that combine CNNs with traditional methods, DOC fully relies on deep learning (DL) frameworks. Though without complex back-end optimization modules, our method achieves outstanding performance with relative transform error (RTE) = 2.0% on KITTI Odometry benchmark for Seq. 09, which outperforms traditional monocular VO frameworks and is comparable to hybrid methods.
Modern high-definition LIDAR is expensive for commercial autonomous driving vehicles and small indoor robots. An affordable solution to this problem is fusion of planar LIDAR with RGB images to provide a similar level of perception capability. Even though state-of-the-art methods provide approaches to predict depth information from limited sensor input, they are usually a simple concatenation of sparse LIDAR features and dense RGB features through an end-to-end fusion architecture. In this paper, we introduce an inductive late-fusion block which better fuses different sensor modalities inspired by a probability model. The proposed demonstration and aggregation network propagates the mixed context and depth features to the prediction network and serves as a prior knowledge of the depth completion. This late-fusion block uses the dense context features to guide the depth prediction based on demonstrations by sparse depth features. In addition to evaluating the proposed method on benchmark depth completion datasets including NYUDepthV2 and KITTI, we also test the proposed method on a simulated planar LIDAR dataset. Our method shows promising results compared to previous approaches on both the benchmark datasets and simulated dataset with various 3D densities.
Deep neural networks have been widely studied in autonomous driving applications such as semantic segmentation or depth estimation. However, training a neural network in a supervised manner requires a large amount of annotated labels which are expensive and time-consuming to collect. Recent studies leverage synthetic data collected from a virtual environment which are much easier to acquire and more accurate compared to data from the real world, but they usually suffer from poor generalization due to the inherent domain shift problem. In this paper, we propose a Domain-Agnostic Contrastive Learning (DACL) which is a two-stage unsupervised domain adaptation framework with cyclic adversarial training and contrastive loss. DACL leads the neural network to learn domain-agnostic representation to overcome performance degradation when there exists a difference between training and test data distribution. Our proposed approach achieves better performance in the monocular depth estimation task compared to previous state-of-the-art methods and also shows effectiveness in the semantic segmentation task.