No Arabic abstract
Guided sparse depth upsampling aims to upsample an irregularly sampled sparse depth map when an aligned high-resolution color image is given as guidance. Many neural networks have been designed for this task. However, they often ignore the structural difference between the depth and the color image, resulting in obvious artifacts such as texture copy and depth blur at the upsampling depth. Inspired by the normalized convolution operation, we propose a guided convolutional layer to recover dense depth from sparse and irregular depth image with an depth edge image as guidance. Our novel guided network can prevent the depth value from crossing the depth edge to facilitate upsampling. We further design a convolution network based on proposed convolutional layer to combine the advantages of different algorithms and achieve better performance. We conduct comprehensive experiments to verify our method on real-world indoor and synthetic outdoor datasets. Our method produces strong results. It outperforms state-of-the-art methods on the Virtual KITTI dataset and the Middlebury dataset. It also presents strong generalization capability under different 3D point densities, various lighting and weather conditions.
Dense depth estimation plays a key role in multiple applications such as robotics, 3D reconstruction, and augmented reality. While sparse signal, e.g., LiDAR and Radar, has been leveraged as guidance for enhancing dense depth estimation, the improvement is limited due to its low density and imbalanced distribution. To maximize the utility from the sparse source, we propose $S^3$ technique, which expands the depth value from sparse cues while estimating the confidence of expanded region. The proposed $S^3$ can be applied to various guided depth estimation approaches and trained end-to-end at different stages, including input, cost volume and output. Extensive experiments demonstrate the effectiveness, robustness, and flexibility of the $S^3$ technique on LiDAR and Radar signal.
This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60 fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free disparity maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of traditional stereo matching approaches. This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision. Spatial precision is achieved by employing a learned edge-aware upsampling function. Our model uses a Siamese network to extract features from the left and right image. A first estimate of the disparity is computed in a very low resolution cost volume, then hierarchically the model re-introduces high-frequency details through a learned upsampling function that uses compact pixel-to-pixel refinement networks. Leveraging color input as a guide, this function is capable of producing high-quality edge-aware output. We achieve compelling results on multiple benchmarks, showing how the proposed method offers extreme flexibility at an acceptable computational budget.
Camouflaged object detection (COD) aims to segment camouflaged objects hiding in the environment, which is challenging due to the similar appearance of camouflaged objects and their surroundings. Research in biology suggests that depth can provide useful object localization cues for camouflaged object discovery, as all the animals have 3D perception ability. However, the depth information has not been exploited for camouflaged object detection. To explore the contribution of depth for camouflage detection, we present a depth-guided camouflaged object detection network with pre-computed depth maps from existing monocular depth estimation methods. Due to the domain gap between the depth estimation dataset and our camouflaged object detection dataset, the generated depth may not be accurate enough to be directly used in our framework. We then introduce a depth quality assessment module to evaluate the quality of depth based on the model prediction from both RGB COD branch and RGB-D COD branch. During training, only high-quality depth is used to update the modal interaction module for multi-modal learning. During testing, our depth quality assessment module can effectively determine the contribution of depth and select the RGB branch or RGB-D branch for camouflage prediction. Extensive experiments on various camouflaged object detection datasets prove the effectiveness of our solution in exploring the depth information for camouflaged object detection. Our code and data is publicly available at: url{https://github.com/JingZhang617/RGBD-COD}.
Depth completion deals with the problem of recovering dense depth maps from sparse ones, where color images are often used to facilitate this completion. Recent approaches mainly focus on image guided learning to predict dense results. However, blurry image guidance and object structures in depth still impede the performance of image guided frameworks. To tackle these problems, we explore a repetitive design in our image guided network to sufficiently and gradually recover depth values. Specifically, the repetition is embodied in a color image guidance branch and a depth generation branch. In the former branch, we design a repetitive hourglass network to extract higher-level image features of complex environments, which can provide powerful context guidance for depth prediction. In the latter branch, we design a repetitive guidance module based on dynamic convolution where the convolution factorization is applied to simultaneously reduce its complexity and progressively model high-frequency structures, e.g., boundaries. Further, in this module, we propose an adaptive fusion mechanism to effectively aggregate multi-step depth features. Extensive experiments show that our method achieves state-of-the-art result on the NYUv2 dataset and ranks 1st on the KITTI benchmark at the time of submission.
High-resolution depth maps can be inferred from low-resolution depth measurements and an additional high-resolution intensity image of the same scene. To that end, we introduce a bimodal co-sparse analysis model, which is able to capture the interdependency of registered intensity and depth information. This model is based on the assumption that the co-supports of corresponding bimodal image structures are aligned when computed by a suitable pair of analysis operators. No analytic form of such operators exist and we propose a method for learning them from a set of registered training signals. This learning process is done offline and returns a bimodal analysis operator that is universally applicable to natural scenes. We use this to exploit the bimodal co-sparse analysis model as a prior for solving inverse problems, which leads to an efficient algorithm for depth map super-resolution.