No Arabic abstract
Building on progress in feature representations for image retrieval, image-based localization has seen a surge of research interest. Image-based localization has the advantage of being inexpensive and efficient, often avoiding the use of 3D metric maps altogether. That said, the need to maintain a large number of reference images as an effective support of localization in a scene, nonetheless calls for them to be organized in a map structure of some kind. The problem of localization often arises as part of a navigation process. We are, therefore, interested in summarizing the reference images as a set of landmarks, which meet the requirements for image-based navigation. A contribution of this paper is to formulate such a set of requirements for the two sub-tasks involved: map construction and self-localization. These requirements are then exploited for compact map representation and accurate self-localization, using the framework of a network flow problem. During this process, we formulate the map construction and self-localization problems as convex quadratic and second-order cone programs, respectively. We evaluate our methods on publicly available indoor and outdoor datasets, where they outperform existing methods significantly.
We present a novel method for visual mapping and localization for autonomous vehicles, by extracting, modeling, and optimizing semantic road elements. Specifically, our method integrates cascaded deep models to detect standardized road elements instead of traditional point features, to seek for improved pose accuracy and map representation compactness. To utilize the structural features, we model road lights and signs by their representative deep keypoints for skeleton and boundary, and parameterize lanes via piecewise cubic splines. Based on the road semantic features, we build a complete pipeline for mapping and localization, which includes a) image processing front-end, b) sensor fusion strategies, and c) optimization backend. Experiments on public datasets and our testing platform have demonstrated the effectiveness and advantages of our method by outperforming traditional approaches.
In this paper, we present a versatile method for visual localization. It is based on robust image retrieval for coarse camera pose estimation and robust local features for accurate pose refinement. Our method is top ranked on various public datasets showing its ability of generalization and its great variety of applications. To facilitate experiments, we introduce kapture, a flexible data format and processing pipeline for structure from motion and visual localization that is released open source. We furthermore provide all datasets used in this paper in the kapture format to facilitate research and data processing. Code and datasets can be found at https://github.com/naver/kapture, more information, updates, and news can be found at https://europe.naverlabs.com/research/3d-vision/kapture.
Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: http://sites.google.com/view/differentiable-mapping
Visual navigation has been widely used for state estimation of micro aerial vehicles (MAVs). For stable visual navigation, MAVs should generate perception-aware paths which guarantee enough visible landmarks. Many previous works on perception-aware path planning focused on sampling-based planners. However, they may suffer from sample inefficiency, which leads to computational burden for finding a global optimal path. To address this issue, we suggest a perception-aware path planner which utilizes topological information of environments. Since the topological class of a path and visible landmarks during traveling the path are closely related, the proposed algorithm checks distinctive topological classes to choose the class with abundant visual information. Topological graph is extracted from the generalized Voronoi diagram of the environment and initial paths with different topological classes are found. To evaluate the perception quality of the classes, we divide the initial path into discrete segments where the points in each segment share similar visual information. The optimal class with high perception quality is selected, and a graph-based planner is utilized to generate path within the class. With simulations and real-world experiments, we confirmed that the proposed method could guarantee accurate visual navigation compared with the perception-agnostic method while showing improved computational efficiency than the sampling-based perception-aware planner.
We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of $19%$ over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing $8$--$15%$ and $4%$ improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model.