ﻻ يوجد ملخص باللغة العربية
Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.
Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users expectations in content search and exploration scenarios. Multi-modal video summarization is one of the metho
Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object detection networ
Multi-modal information is essential to describe what has happened in a video. In this work, we represent videos by various appearance, motion and audio information guided with video topic. By following multi-stage training strategy, our experiments
Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a metho
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages. We identified three crucial factors that improve the performance, namely: multi-v