No Arabic abstract
High-dynamic-range (HDR) photography involves fusing a bracket of images taken at different exposure settings in order to compensate for the low dynamic range of digital cameras such as the ones used in smartphones. In this paper, a method for automatically selecting the exposure settings of such images is introduced based on the camera characteristic function. In addition, a new fusion method is introduced based on an optimization formulation and weighted averaging. Both of these methods are implemented on a smartphone platform as an HDR app to demonstrate the practicality of the introduced methods. Comparison results with several existing methods are presented indicating the effectiveness as well as the computational efficiency of the introduced solution.
This paper considers the problem of generating an HDR image of a scene from its LDR images. Recent studies employ deep learning and solve the problem in an end-to-end fashion, leading to significant performance improvements. However, it is still hard to generate a good quality image from LDR images of a dynamic scene captured by a hand-held camera, e.g., occlusion due to the large motion of foreground objects, causing ghosting artifacts. The key to success relies on how well we can fuse the input images in their feature space, where we wish to remove the factors leading to low-quality image generation while performing the fundamental computations for HDR image generation, e.g., selecting the best-exposed image/region. We propose a novel method that can better fuse the features based on two ideas. One is multi-step feature fusion; our network gradually fuses the features in a stack of blocks having the same structure. The other is the design of the component block that effectively performs two operations essential to the problem, i.e., comparing and selecting appropriate images/regions. Experimental results show that the proposed method outperforms the previous state-of-the-art methods on the standard benchmark tests.
Deep convolutional neural networks (DCNN) aided high dynamic range (HDR) imaging recently received a lot of attention. The quality of DCNN generated HDR images have overperformed the traditional counterparts. However, DCNN is prone to be computationally intensive and power-hungry. To address the challenge, we propose LightFuse, a light-weight CNN-based algorithm for extreme dual-exposure image fusion, which can be implemented on various embedded computing platforms with limited power and hardware resources. Two sub-networks are utilized: a GlobalNet (G) and a DetailNet (D). The goal of G is to learn the global illumination information on the spatial dimension, whereas D aims to enhance local details on the channel dimension. Both G and D are based solely on depthwise convolution (D Conv) and pointwise convolution (P Conv) to reduce required parameters and computations. Experimental results display that the proposed technique could generate HDR images with plausible details in extremely exposed regions. Our PSNR score exceeds the other state-of-the-art approaches by 1.2 to 1.6 times and achieves 1.4 to 20 times FLOP and parameter reduction compared with others.
3D multi-object tracking is an important component in robotic perception systems such as self-driving vehicles. Recent work follows a tracking-by-detection pipeline, which aims to match past tracklets with detections in the current frame. To avoid matching with false positive detections, prior work filters out detections with low confidence scores via a threshold. However, finding a proper threshold is non-trivial, which requires extensive manual search via ablation study. Also, this threshold is sensitive to many factors such as target object category so we need to re-search the threshold if these factors change. To ease this process, we propose to automatically select high-quality detections and remove the efforts needed for manual threshold search. Also, prior work often uses a single threshold per data sequence, which is sub-optimal in particular frames or for certain objects. Instead, we dynamically search threshold per frame or per object to further boost performance. Through experiments on KITTI and nuScenes, our method can filter out $45.7%$ false positives while maintaining the recall, achieving new S.O.T.A. performance and removing the need for manually threshold tuning.
Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.
Saliency prediction for Standard Dynamic Range (SDR) videos has been well explored in the last decade. However, limited studies are available on High Dynamic Range (HDR) Visual Attention Models (VAMs). Considering that the characteristic of HDR content in terms of dynamic range and color gamut is quite different than those of SDR content, it is essential to identify the importance of different saliency attributes of HDR videos for designing a VAM and understand how to combine these features. To this end we propose a learning-based visual saliency fusion method for HDR content (LVBS-HDR) to combine various visual saliency features. In our approach various conspicuity maps are extracted from HDR data, and then for fusing conspicuity maps, a Random Forests algorithm is used to train a model based on the collected data from an eye-tracking experiment. Performance evaluations demonstrate the superiority of the proposed fusion method against other existing fusion methods.