Towards seamless multi-view scene analysis from satellite to street-level

89 0 0.0 ( 0 )

Download Cite

Added by Devis Tuia

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Sebastien Lef`evre - Devis Tuia - Jan Dirk Wegner

Computer Vision and Pattern Recognition

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper, we discuss and review how combined multi-view imagery from satellite to street-level can benefit scene analysis. Numerous works exist that merge information from remote sensing and images acquired from the ground for tasks like land cover mapping, object detection, or scene understanding. What makes the combination of overhead and street-level images challenging, is the strongly varying viewpoint, different scale, illumination, sensor modality and time of acquisition. Direct (dense) matching of images on a per-pixel basis is thus often impossible, and one has to resort to alternative strategies that will be discussed in this paper. We review recent works that attempt to combine images taken from the ground and overhead views for purposes like scene registration, reconstruction, or classification. Three methods that represent the wide range of potential methods and applications (change detection, image orientation, and tree cataloging) are described in detail. We show that cross-fertilization between remote sensing, computer vision and machine learning is very valuable to make the best of geographic data available from Earth Observation sensors and ground imagery. Despite its challenges, we believe that integrating these complementary data sources will lead to major breakthroughs in Big GeoData.

rate research

Sat2Vid: Street-view Panoramic Video Synthesis from a Single Satellite Image

148 - Zuoyue Li , Zhenqiang Li , Zhaopeng Cui 2020

We present a novel method for synthesizing both temporally and geometrically consistent street-view panoramic video from a single satellite image and camera trajectory. Existing cross-view synthesis approaches focus on images, while video synthesis in such a case has not yet received enough attention. For geometrical and temporal consistency, our approach explicitly creates a 3D point cloud representation of the scene and maintains dense 3D-2D correspondences across frames that reflect the geometric scene configuration inferred from the satellite view. As for synthesis in the 3D space, we implement a cascaded network architecture with two hourglass modules to generate point-wise coarse and fine features from semantics and per-class latent vectors, followed by projection to frames and an upsampling module to obtain the final realistic video. By leveraging computed correspondences, the produced street-view video frames adhere to the 3D geometric scene structure and maintain temporal consistency. Qualitative and quantitative experiments demonstrate superior results compared to other state-of-the-art synthesis approaches that either lack temporal consistency or realistic appearance. To the best of our knowledge, our work is the first one to synthesize cross-view images to video.

Computer Vision and Pattern Recognition

Geocoding of trees from street addresses and street-level images

118 - Daniel Laumer , Nico Lang , Natalie van Doorn 2020

We introduce an approach for updating older tree inventories with geographic coordinates using street-level panorama images and a global optimization framework for tree instance matching. Geolocations of trees in inventories until the early 2000s where recorded using street addresses whereas newer inventories use GPS. Our method retrofits older inventories with geographic coordinates to allow connecting them with newer inventories to facilitate long-term studies on tree mortality etc. What makes this problem challenging is the different number of trees per street address, the heterogeneous appearance of different tree instances in the images, ambiguous tree positions if viewed from multiple images and occlusions. To solve this assignment problem, we (i) detect trees in Google street-view panoramas using deep learning, (ii) combine multi-view detections per tree into a single representation, (iii) and match detected trees with given trees per street address with a global optimization approach. Experiments for > 50000 trees in 5 cities in California, USA, show that we are able to assign geographic coordinates to 38 % of the street trees, which is a good starting point for long-term studies on the ecosystem services value of street trees at large scale.

Computer Vision and Pattern Recognition

Multi-view Fusion for Multi-level Robotic Scene Understanding

58 - Yunzhi Lin , Jonathan Tremblay , Stephen Tyree 2021

We present a system for multi-level scene awareness for robotic manipulation. Given a sequence of camera-in-hand RGB images, the system calculates three types of information: 1) a point cloud representation of all the surfaces in the scene, for the purpose of obstacle avoidance. 2) the rough pose of unknown objects from categories corresponding to primitive shapes (e.g., cuboids and cylinders), and 3) full 6-DoF pose of known objects. By developing and fusing recent techniques in these domains, we provide a rich scene representation for robot awareness. We demonstrate the importance of each of these modules, their complementary nature, and the potential benefits of the system in the context of robotic manipulation.

Robotics

Privacy Protection in Street-View Panoramas using Depth and Multi-View Imagery

84 - Ries Uittenbogaard , Clint Sebastian , Julien Vijverberg 2019

The current paradigm in privacy protection in street-view images is to detect and blur sensitive information. In this paper, we propose a framework that is an alternative to blurring, which automatically removes and inpaints moving objects (e.g. pedestrians, vehicles) in street-view imagery. We propose a novel moving object segmentation algorithm exploiting consistencies in depth across multiple street-view images that are later combined with the results of a segmentation network. The detected moving objects are removed and inpainted with information from other views, to obtain a realistic output image such that the moving object is not visible anymore. We evaluate our results on a dataset of 1000 images to obtain a peak noise-to-signal ratio (PSNR) and L1 loss of 27.2 dB and 2.5%, respectively. To ensure the subjective quality, To assess overall quality, we also report the results of a survey conducted on 35 professionals, asked to visually inspect the images whether object removal and inpainting had taken place. The inpainting dataset will be made publicly available for scientific benchmarking purposes at https://research.cyclomedia.com

Computer Vision and Pattern Recognition

Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

168 - Ian J. Goodfellow , Yaroslav Bulatov , Julian Ibarz 2013

Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural network that operates directly on the image pixels. We employ the DistBelief implementation of deep neural networks in order to train large, distributed neural networks on high quality images. We find that the performance of this approach increases with the depth of the convolutional network, with the best performance occurring in the deepest architecture we trained, with eleven hidden layers. We evaluate this approach on the publicly available SVHN dataset and achieve over $96%$ accuracy in recognizing complete street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art, achieving $97.84%$ accuracy. We also evaluate this approach on an even more challenging dataset generated from Street View imagery containing several tens of millions of street number annotations and achieve over $90%$ accuracy. To further explore the applicability of the proposed system to broader text recognition tasks, we apply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of the most secure reverse turing tests that uses distorted text to distinguish humans from bots. We report a $99.8%$ accuracy on the hardest category of reCAPTCHA. Our evaluations on both tasks indicate that at specific operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators.

Computer Vision and Pattern Recognition