ﻻ يوجد ملخص باللغة العربية
We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the Extended Soft Attention (E-SA) model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the textit{visual} words are processed with an attention module and the textit{verbal} ones not. It also utilizes the semantic structure of the sentences by combining the neighbors based on the recursive structure of the parse trees. The understandings of the words and the videos are propagated and merged from leaves to the root. Furthermore, we build a hierarchical attention mechanism to distill the attended features. We evaluate our approach on two datasets. The experimental results show the superiority of our HTreeMN model over the other attention models especially on complex questions. Our code is available on github. Our code is available at https://github.com/ZJULearning/TreeAttention
We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurr
Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models application scenario. In this work, we leverage semantic roles deriv
We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features;
This paper considers a network referred to as Modality Shifting Attention Network (MSAN) for Multimodal Video Question Answering (MVQA) task. MSAN decomposes the task into two sub-tasks: (1) localization of temporal moment relevant to the question, a