No Arabic abstract
Low visual quality has prevented underwater robotic vision from a wide range of applications. Although several algorithms have been developed, real-time and adaptive methods are deficient for real-world tasks. In this paper, we address this difficulty based on generative adversarial networks (GAN), and propose a GAN-based restoration scheme (GAN-RS). In particular, we develop a multi-branch discriminator including an adversarial branch and a critic branch for the purpose of simultaneously preserving image content and removing underwater noise. In addition to adversarial learning, a novel dark channel prior loss also promotes the generator to produce realistic vision. More specifically, an underwater index is investigated to describe underwater properties, and a loss function based on the underwater index is designed to train the critic branch for underwater noise suppression. Through extensive comparisons on visual quality and feature restoration, we confirm the superiority of the proposed approach. Consequently, the GAN-RS can adaptively improve underwater visual quality in real time and induce an overall superior restoration performance. Finally, a real-world experiment is conducted on the seabed for grasping marine products, and the results are quite promising. The source code is publicly available at https://github.com/SeanChenxy/GAN_RS.
Robot localization remains a challenging task in GPS denied environments. State estimation approaches based on local sensors, e.g. cameras or IMUs, are drifting-prone for long-range missions as error accumulates. In this study, we aim to address this problem by localizing image observations in a 2D multi-modal geospatial map. We introduce the cross-scale dataset and a methodology to produce additional data from cross-modality sources. We propose a framework that learns cross-scale visual representations without supervision. Experiments are conducted on data from two different domains, underwater and aerial. In contrast to existing studies in cross-view image geo-localization, our approach a) performs better on smaller-scale multi-modal maps; b) is more computationally efficient for real-time applications; c) can serve directly in concert with state estimation pipelines.
This paper presents a new underwater dataset acquired from a visual-inertial-pressure acquisition system and meant to be used to benchmark visual odometry, visual SLAM and multi-sensors SLAM solutions. The dataset is publicly available and contains ground-truth trajectories for evaluation.
Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated sequences of frames and their associated action pairs. We introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision. In addition to steering controls, it also includes controls for sampling features of a scene, such as the weather as well as the location of non-player objects. Since DriveGAN is a fully differentiable simulator, it further allows for re-simulation of a given video sequence, offering an agent to drive through a recorded scene again, possibly taking different actions. We train DriveGAN on multiple datasets, including 160 hours of real-world driving data. We showcase that our approach greatly surpasses the performance of previous data-driven simulators, and allows for new features not explored before.
In real-world underwater environment, exploration of seabed resources, underwater archaeology, and underwater fishing rely on a variety of sensors, vision sensor is the most important one due to its high information content, non-intrusive, and passive nature. However, wavelength-dependent light attenuation and back-scattering result in color distortion and haze effect, which degrade the visibility of images. To address this problem, firstly, we proposed an unsupervised generative adversarial network (GAN) for generating realistic underwater images (color distortion and haze effect) from in-air image and depth map pairs based on improved underwater imaging model. Secondly, U-Net, which is trained efficiently using synthetic underwater dataset, is adopted for color restoration and dehazing. Our model directly reconstructs underwater clear images using end-to-end autoencoder networks, while maintaining scene content structural similarity. The results obtained by our method were compared with existing methods qualitatively and quantitatively. Experimental results obtained by the proposed model demonstrate well performance on open real-world underwater datasets, and the processing speed can reach up to 125FPS running on one NVIDIA 1060 GPU. Source code, sample datasets are made publicly available at https://github.com/infrontofme/UWGAN_UIE.
In recent years, deep-learning-based visual object trackers have been studied thoroughly, but handling occlusions and/or rapid motion of the target remains challenging. In this work, we argue that conditioning on the natural language (NL) description of a target provides information for longer-term invariance, and thus helps cope with typical tracking challenges. However, deriving a formulation to combine the strengths of appearance-based tracking with the language modality is not straightforward. We propose a novel deep tracking-by-detection formulation that can take advantage of NL descriptions. Regions that are related to the given NL description are generated by a proposal network during the detection phase of the tracker. Our LSTM based tracker then predicts the update of the target from regions proposed by the NL based detection phase. In benchmarks, our method is competitive with state of the art trackers, while it outperforms all other trackers on targets with unambiguous and precise language annotations. It also beats the state-of-the-art NL tracker when initializing without a bounding box. Our method runs at over 30 fps on a single GPU.