We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models. For visual feature, some detection techniques are used to improve the detector. For text feature, we adopt BERT as the language model and find that it can significantly improve VQA performance. Our solution won the second place in the VQA Challenge 2019.
We propose an Auto-Parsing Network (APN) to discover and exploit the input datas hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this trees hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering. Also, a PGM probability-based parsing algorithm is developed by which we can discover what the hidden structure of input is during the inference.
Is it possible to develop an AI Pathologist to pass the board-certified examination of the American Board of Pathology (ABP)? To build such a system, three challenges need to be addressed. First, we need to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Due to privacy concerns, pathology images are usually not publicly available. Besides, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. The second challenge is: since it is difficult to hire highly experienced pathologists to create pathology visual questions and answers, the resulting pathology VQA dataset may contain errors. Training pathology VQA models using these noisy or even erroneous data will lead to problematic models that cannot generalize well on unseen images. The third challenge is: the medical concepts and knowledge covered in pathology question-answer (QA) pairs are very diverse while the number of QA pairs available for modeling training is limited. How to learn effective representations of diverse medical concepts based on limited data is technically demanding. In this paper, we aim to address these three challenges. To our best knowledge, our work represents the first one addressing the pathology VQA problem. To deal with the issue that a publicly available pathology VQA dataset is lacking, we create PathVQA dataset. To address the second challenge, we propose a learning-by-ignoring approach. To address the third challenge, we propose to use cross-modal self-supervised learning. We perform experiments on our created PathVQA dataset and the results demonstrate the effectiveness of our proposed learning-by-ignoring method and cross-modal self-supervised learning methods.
We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .
This paper studies the task of Visual Question Answering (VQA), which is topical in Multimedia community recently. Particularly, we explore two critical research problems existed in VQA: (1) efficiently fusing the visual and textual modalities; (2) enabling the visual reasoning ability of VQA models in answering complex questions. To address these challenging problems, a novel Question Guided Modular Routing Networks (QGMRN) has been proposed in this paper. Particularly, The QGMRN is composed of visual, textual and routing network. The visual and textual network serve as the backbones for the generic feature extractors of visual and textual modalities. QGMRN can fuse the visual and textual modalities at multiple semantic levels. Typically, the visual reasoning is facilitated by the routing network in a discrete and stochastic way by using Gumbel-Softmax trick for module selection. When the input reaches a certain modular layer, routing network newly proposed in this paper, dynamically selects a portion of modules from that layer to process the input depending on the question features generated by the textual network. It can also learn to reason by routing between the generic modules without additional supervision information or expert knowledge. Benefiting from the dynamic routing mechanism, QGMRN can outperform the previous classical VQA methods by a large margin and achieve the competitive results against the state-of-the-art methods. Furthermore, attention mechanism is integrated into our QGMRN model and thus can further boost the model performance. Empirically, extensive experiments on the CLEVR and CLEVR-Humans datasets validate the effectiveness of our proposed model, and the state-of-the-art performance has been achieved.
Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the models predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art.