ﻻ يوجد ملخص باللغة العربية
Making black box models explainable is a vital problem that accompanies the development of deep learning networks. For networks taking visual information as input, one basic but challenging explanation method is to identify and visualize the input pixels/regions that dominate the networks prediction. However, most existing works focus on explaining networks taking a single image as input and do not consider the temporal relationship that exists in videos. Providing an easy-to-use visual explanation method that is applicable to diversified structures of video understanding networks still remains an open challenge. In this paper, we investigate a generic perturbation-based method for visually explaining video understanding networks. Besides, we propose a novel loss function to enhance the method by constraining the smoothness of its results in both spatial and temporal dimensions. The method enables the comparison of explanation results between different network structures to become possible and can also avoid generating the pathological adversarial explanations for video inputs. Experimental comparison results verified the effectiveness of our method.
Our world offers a never-ending stream of visual stimuli, yet todays vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, w
Independently exploring unknown spaces or finding objects in an indoor environment is a daily but challenging task for visually impaired people. However, common 2D assistive systems lack depth relationships between various objects, resulting in diffi
Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a signi
With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a con
In order to resonate with the viewers, many video advertisements explore creative narrative techniques such as Freytags pyramid where a story begins with exposition, followed by rising action, then climax, concluding with denouement. In the dramatic