X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering

115 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jingjing Jiang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jingjing Jiang - Ziyi Liu - Yifan Liu

الرؤية الحاسوبية وتمييز الأنماط الوسائط المتعددة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components.

قيم البحث

111 - Zixu Wang , Yishu Miao , Lucia Specia 2021

Data augmentation is an approach that can effectively improve the performance of multimodal machine learning. This paper introduces a generative model for data augmentation by leveraging the correlations among multiple modalities. Different from conv entional data augmentation approaches that apply low level operations with deterministic heuristics, our method proposes to learn an augmentation sampler that generates samples of the target modality conditioned on observed modalities in the variational auto-encoder framework. Additionally, the proposed model is able to quantify the confidence of augmented data by its generative probability, and can be jointly updated with a downstream pipeline. Experiments on Visual Question Answering tasks demonstrate the effectiveness of the proposed generative model, which is able to boost the strong UpDn-based models to the state-of-the-art performance.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

173 - Medhini Narasimhan , Svetlana Lazebnik , Alexander G. Schwing 2018

Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel `fact-based visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to `reason about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7% compared to the state of the art.

الرؤية الحاسوبية وتمييز الأنماط

Overcoming Language Priors with Self-supervised Learning for Visual Question Answering

111 - Xi Zhu , Zhendong Mao , Chunxiao Liu 2020

Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e.g., what color is the banana?) based on the high-frequency answers (e.g. , yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency while strengthening image dependency. However, they are still subject to the language prior problem since the data biases have not been even alleviated. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and propose a self-supervised auxiliary task to utilize the balanced data to assist the base VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method can significantly outperform the state-of-the-art, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations.

الرؤية الحاسوبية وتمييز الأنماط الوسائط المتعددة

Pathological Visual Question Answering

111 - Xuehai He , Zhuo Cai , Wenlan Wei 2020

Is it possible to develop an AI Pathologist to pass the board-certified examination of the American Board of Pathology (ABP)? To build such a system, three challenges need to be addressed. First, we need to create a visual question answering (VQA) da taset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Due to privacy concerns, pathology images are usually not publicly available. Besides, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. The second challenge is: since it is difficult to hire highly experienced pathologists to create pathology visual questions and answers, the resulting pathology VQA dataset may contain errors. Training pathology VQA models using these noisy or even erroneous data will lead to problematic models that cannot generalize well on unseen images. The third challenge is: the medical concepts and knowledge covered in pathology question-answer (QA) pairs are very diverse while the number of QA pairs available for modeling training is limited. How to learn effective representations of diverse medical concepts based on limited data is technically demanding. In this paper, we aim to address these three challenges. To our best knowledge, our work represents the first one addressing the pathology VQA problem. To deal with the issue that a publicly available pathology VQA dataset is lacking, we create PathVQA dataset. To address the second challenge, we propose a learning-by-ignoring approach. To address the third challenge, we propose to use cross-modal self-supervised learning. We perform experiments on our created PathVQA dataset and the results demonstrate the effectiveness of our proposed learning-by-ignoring method and cross-modal self-supervised learning methods.

الرؤية الحاسوبية وتمييز الأنماط

Simple Baseline for Visual Question Answering

234 - Bolei Zhou , Yuandong Tian , Sainbayar Sukhbaatar 2015

We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة