No Arabic abstract
Light emitted from a source into a scene can undergo complex interactions with scene surfaces of different material types before being reflected. During this transport, every surface reflection is encoded in the properties of the photons that reach the detector, including time, direction, intensity, wavelength and polarization. Conventional imaging systems capture intensity by integrating over all other dimensions of the light, hiding this rich scene information. Existing methods are capable of untangling these measurements into their spatial and temporal dimensions, fueling geometric scene understanding tasks. However, examining material properties jointly with geometric properties is an open challenge that could enable unprecedented capabilities beyond geometric scene understanding, allowing for material-dependent scene understanding and imaging through complex transport. In this work, we close this gap, and propose a computational light transport imaging method that captures the spatially- and temporally-resolved complete polarimetric response of a scene. Our method hinges on a 7D tensor theory of light transport. We discover low-rank structure in the polarimetric tensor dimension and propose a data-driven rotating ellipsometry method that learns to exploit redundancy of polarimetric structure. We instantiate our theory with two prototypes: spatio-polarimetric imaging and coaxial temporal-polarimetric imaging. This allows us, for the first time, to decompose scene light transport into temporal, spatial, and complete polarimetric dimensions that unveil scene properties hidden to conventional methods. We validate the applicability of our method on diverse tasks, including shape reconstruction with subsurface scattering, seeing through scattering media, untangling multi-bounce light transport, breaking metamerism, and decomposition of crystals.
The attribution method provides a direction for interpreting opaque neural networks in a visual way by identifying and visualizing the input regions/pixels that dominate the output of a network. Regarding the attribution method for visually explaining video understanding networks, it is challenging because of the unique spatiotemporal dependencies existing in video inputs and the special 3D convolutional or recurrent structures of video understanding networks. However, most existing attribution methods focus on explaining networks taking a single image as input and a few works specifically devised for video attribution come short of dealing with diversified structures of video understanding networks. In this paper, we investigate a generic perturbation-based attribution method that is compatible with diversified video understanding networks. Besides, we propose a novel regularization term to enhance the method by constraining the smoothness of its attribution results in both spatial and temporal dimensions. In order to assess the effectiveness of different video attribution methods without relying on manual judgement, we introduce reliable objective metrics which are checked by a newly proposed reliability measurement. We verified the effectiveness of our method by both subjective and objective evaluation and comparison with multiple significant attribution methods.
With only bounding-box annotations in the spatial domain, existing video scene text detection (VSTD) benchmarks lack temporal relation of text instances among video frames, which hinders the development of video text-related applications. In this paper, we systematically introduce a new large-scale benchmark, named as STVText4, a well-designed spatial-temporal detection metric (STDM), and a novel clustering-based baseline method, referred to as Temporal Clustering (TC). STVText4 opens a challenging yet promising direction of VSTD, termed as ST-VSTD, which targets at simultaneously detecting video scene texts in both spatial and temporal domains. STVText4 contains more than 1.4 million text instances from 161,347 video frames of 106 videos, where each instance is annotated with not only spatial bounding box and temporal range but also four intrinsic attributes, including legibility, density, scale, and lifecycle, to facilitate the community. With continuous propagation of identical texts in the video sequence, TC can accurately output the spatial quadrilateral and temporal range of the texts, which sets a strong baseline for ST-VSTD. Experiments demonstrate the efficacy of our method and the great academic and practical value of the STVText4. The dataset and code will be available soon.
We present an apparatus that converts every pulse of a pulsed light source to a pulse train in which the intensities of the different pulses are samples of the spatial or temporal frequency spectrum of the original pulse. In this way, the spectrum of the incident light can be measured by following the temporal response of a single detector. The apparatus is based on multiple round-trips inside a 2f- cavity-like mirror arrangement in which the spectrum is spread on the back focal plane, where after each round-trip a small section of the spectrum is allowed to escape. The apparatus is fibre-free, offers easy wavelength range tunability, and a prototype built achieves over 10% average efficiency in the near infra red. We demonstrate the application of the prototype for the efficient measurement of the joint spectrum of a non-degenerate bi-photon source in which one of the photons is in the near infra red.
In this work, we aim to segment and detect water in videos. Water detection is beneficial for appllications such as video search, outdoor surveillance, and systems such as unmanned ground vehicles and unmanned aerial vehicles. The specific problem, however, is less discussed compared to general texture recognition. Here, we analyze several motion properties of water. First, we describe a video pre-processing step, to increase invariance against water reflections and water colours. Second, we investigate the temporal and spatial properties of water and derive corresponding local descriptors. The descriptors are used to locally classify the presence of water and a binary water detection mask is generated through spatio-temporal Markov Random Field regularization of the local classifications. Third, we introduce the Video Water Database, containing several hours of water and non-water videos, to validate our algorithm. Experimental evaluation on the Video Water Database and the DynTex database indicates the effectiveness of the proposed algorithm, outperforming multiple algorithms for dynamic texture recognition and material recognition by ca. 5% and 15% respectively.
Large-area crop classification using multi-spectral imagery is a widely studied problem for several decades and is generally addressed using classical Random Forest classifier. Recently, deep convolutional neural networks (DCNN) have been proposed. However, these methods only achieved results comparable with Random Forest. In this work, we present a novel CNN based architecture for large-area crop classification. Our methodology combines both spatio-temporal analysis via 3D CNN as well as temporal analysis via 1D CNN. We evaluated the efficacy of our approach on Yolo and Imperial county benchmark datasets. Our combined strategy outperforms both classical as well as recent DCNN based methods in terms of classification accuracy by 2% while maintaining a minimum number of parameters and the lowest inference time.