ﻻ يوجد ملخص باللغة العربية
Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal dependency. 2) the audiovisual fusion LSTM is employed to fuse the two modalities by exploring the latent consistency between them. 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information, and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, emph{i.e.,} SumMe and TVsum, have demonstrated the effectiveness of each part, and the superiority of AVRN compared to those approaches just exploiting visual information for video summarization.
Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by
We present a method for creating video summaries in real-time on commodity hardware. Real-time here refers to the fact that the time required for video summarization is less than the duration of the input video. First, low-level features are use to d
Video is an essential imaging modality for diagnostics, e.g. in ultrasound imaging, for endoscopy, or movement assessment. However, video hasnt received a lot of attention in the medical image analysis community. In the clinical practice, it is chall
Compact keyframe-based video summaries are a popular way of generating viewership on video sharing platforms. Yet, creating relevant and compelling summaries for arbitrarily long videos with a small number of keyframes is a challenging task. We propo