No Arabic abstract
While depth sensors are becoming increasingly popular, their spatial resolution often remains limited. Depth super-resolution therefore emerged as a solution to this problem. Despite much progress, state-of-the-art techniques suffer from two drawbacks: (i) they rely on the assumption that intensity edges coincide with depth discontinuities, which, unfortunately, is only true in controlled environments; and (ii) they typically exploit the availability of high-resolution training depth maps, which can often not be acquired in practice due to the sensors limitations. By contrast, here, we introduce an approach to performing depth super-resolution in more challenging conditions, such as in outdoor scenes. To this end, we first propose to exploit semantic information to better constrain the super-resolution process. In particular, we design a co-sparse analysis model that learns filters from joint intensity, depth and semantic information. Furthermore, we show how low-resolution training depth maps can be employed in our learning strategy. We demonstrate the benefits of our approach over state-of-the-art depth super-resolution methods on two outdoor scene datasets.
RGBD images, combining high-resolution color and lower-resolution depth from various types of depth sensors, are increasingly common. One can significantly improve the resolution of depth maps by taking advantage of color information; deep learning methods make combining color and depth information particularly easy. However, fusing these two sources of data may lead to a variety of artifacts. If depth maps are used to reconstruct 3D shapes, e.g., for virtual reality applications, the visual quality of upsampled images is particularly important. The main idea of our approach is to measure the quality of depth map upsampling using renderings of resulting 3D surfaces. We demonstrate that a simple visual appearance-based loss, when used with either a trained CNN or simply a deep prior, yields significantly improved 3D shapes, as measured by a number of existing perceptual metrics. We compare this approach with a number of existing optimization and learning-based techniques.
Unsupervised domain adaptation (UDA) is important for applications where large scale annotation of representative data is challenging. For semantic segmentation in particular, it helps deploy on real target domain data models that are trained on annotated images from a different source domain, notably a virtual environment. To this end, most previous works consider semantic segmentation as the only mode of supervision for source domain data, while ignoring other, possibly available, information like depth. In this work, we aim at exploiting at best such a privileged information while training the UDA model. We propose a unified depth-aware UDA framework that leverages in several complementary ways the knowledge of dense depth in the source domain. As a result, the performance of the trained semantic segmentation model on the target domain is boosted. Our novel approach indeed achieves state-of-the-art performance on different challenging synthetic-2-real benchmarks.
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are available at https://github.com/SeokjuLee/Insta-DM .
Memorability measures how easily an image is to be memorized after glancing, which may contribute to designing magazine covers, tourism publicity materials, and so forth. Recent works have shed light on the visual features that make generic images, object images or face photographs memorable. However, these methods are not able to effectively predict the memorability of outdoor natural scene images. To overcome this shortcoming of previous works, in this paper, we provide an attempt to answer: what exactly makes outdoor natural scenes memorable. To this end, we first establish a large-scale outdoor natural scene image memorability (LNSIM) database, containing 2,632 outdoor natural scene images with their ground truth memorability scores and the multi-label scene category annotations. Then, similar to previous works, we mine our database to investigate how low-, middle- and high-level handcrafted features affect the memorability of outdoor natural scenes. In particular, we find that the high-level feature of scene category is rather correlated with outdoor natural scene memorability, and the deep features learnt by deep neural network (DNN) are also effective in predicting the memorability scores. Moreover, combining the deep features with the category feature can further boost the performance of memorability prediction. Therefore, we propose an end-to-end DNN based outdoor natural scene memorability (DeepNSM) predictor, which takes advantage of the learned category-related features. Then, the experimental results validate the effectiveness of our DeepNSM model, exceeding the state-of-the-art methods. Finally, we try to understand the reason of the good performance for our DeepNSM model, and also study the cases that our DeepNSM model succeeds or fails to accurately predict the memorability of outdoor natural scenes.
Scene text recognition has witnessed rapid development with the advance of convolutional neural networks. Nonetheless, most of the previous methods may not work well in recognizing text with low resolution which is often seen in natural scene images. An intuitive solution is to introduce super-resolution techniques as pre-processing. However, conventional super-resolution methods in the literature mainly focus on reconstructing the detailed texture of natural images, which typically do not work well for text due to the unique characteristics of text. To tackle these problems, in this work, we propose a content-aware text super-resolution network to generate the information desired for text recognition. In particular, we design an end-to-end network that can perform super-resolution and text recognition simultaneously. Different from previous super-resolution methods, we use the loss of text recognition as the Text Perceptual Loss to guide the training of the super-resolution network, and thus it pays more attention to the text content, rather than the irrelevant background area. Extensive experiments on several challenging benchmarks demonstrate the effectiveness of our proposed method in restoring a sharp high-resolution image from a small blurred one, and show that the recognition performance clearly boosts up the performance of text recognizer. To our knowledge, this is the first work focusing on text super-resolution. Code will be released in https://github.com/xieenze/TextSR.