ﻻ يوجد ملخص باللغة العربية
The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video feature encoding, the language decoder is still largely based on the prevailing RNN decoder such as LSTM, which tends to prefer the frequent word that aligns with the video. In this paper, we propose a boundary-aware hierarchical language decoder for video captioning, which consists of a high-level GRU based language decoder, working as a global (caption-level) language model, and a low-level GRU based language decoder, working as a local (phrase-level) language model. Most importantly, we introduce a binary gate into the low-level GRU language decoder to detect the language boundaries. Together with other advanced components including joint video prediction, shared soft attention, and boundary-aware video encoding, our integrated video captioning framework can discover hierarchical language information and distinguish the subject and the object in a sentence, which are usually confusing during the language generation. Extensive experiments on two widely-used video captioning datasets, MSR-Video-to-Text (MSR-VTT) cite{xu2016msr} and YouTube-to-Text (MSVD) cite{chen2011collecting} show that our method is highly competitive, compared with the state-of-the-art methods.
Recent advances of video captioning often employ a recurrent neural network (RNN) as the decoder. However, RNN is prone to diluting long-term information. Recent works have demonstrated memory network (MemNet) has the advantage of storing long-term i
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated localize-then-describe scheme, which heavily relies on numerous hand-crafted components. In thi
Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years. In this paper, we convert traditional video captioning task into a new
Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, b
Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for conten