ﻻ يوجد ملخص باللغة العربية
Recent advances of video captioning often employ a recurrent neural network (RNN) as the decoder. However, RNN is prone to diluting long-term information. Recent works have demonstrated memory network (MemNet) has the advantage of storing long-term information. However, as the decoder, it has not been well exploited for video captioning. The reason partially comes from the difficulty of sequence decoding with MemNet. Instead of the common practice, i.e., sequence decoding with RNN, in this paper, we devise a novel memory decoder for video captioning. Concretely, after obtaining representation of each frame through a pre-trained network, we first fuse the visual and lexical information. Then, at each time step, we construct a multi-layer MemNet-based decoder, i.e., in each layer, we employ a memory set to store previous information and an attention mechanism to select the information related to the current input. Thus, this decoder avoids the dilution of long-term information. And the multi-layer architecture is helpful for capturing dependencies between frames and word sequences. Experimental results show that even without the encoding network, our decoder still could obtain competitive performance and outperform the performance of RNN decoder. Furthermore, compared with one-layer RNN decoder, our decoder has fewer parameters.
The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated localize-then-describe scheme, which heavily relies on numerous hand-crafted components. In thi
Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video represe
State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence betwe
Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we p