ﻻ يوجد ملخص باللغة العربية
This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE. Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we investigate the applicability of audio cues in conjunction with visual ones in predicting saliency maps using deep neural networks. To this end, the proposed model is intentionally designed to be simple. Two baseline models are developed on the same architecture which consists of an encoder-decoder. The encoder projects the input into a feature space followed by a decoder that infers saliency. We conduct an extensive analysis on different modalities and various aspects of multi-model dynamic saliency prediction. Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53.54% of the frames predicted by the best Visual model (our baseline). Our endeavour demonstrates that audio is an important cue that boosts dynamic video saliency prediction and helps models to approach human performance. The code is available at https://github.com/hrtavakoli/DAVE
Using only a model that was trained to predict where people look at images, and no additional training data, we can produce a range of powerful editing effects for reducing distraction in images. Given an image and a mask specifying the region to edi
Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications.
Saliency computation models aim to imitate the attention mechanism in the human visual system. The application of deep neural networks for saliency prediction has led to a drastic improvement over the last few years. However, deep models have a high
Over the past decade, many computational saliency prediction models have been proposed for 2D images and videos. Considering that the human visual system has evolved in a natural 3D environment, it is only natural to want to design visual attention m
Saliency prediction for Standard Dynamic Range (SDR) videos has been well explored in the last decade. However, limited studies are available on High Dynamic Range (HDR) Visual Attention Models (VAMs). Considering that the characteristic of HDR conte