بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Simple Baseline for Visual Question Answering

235 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Bolei Zhou

تاريخ النشر 2015

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Bolei Zhou - Yuandong Tian - Sainbayar Sukhbaatar

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

قيم البحث

423 - Sasha Sheng , Amanpreet Singh , Vedanuj Goswami 2021

Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stre ss test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the models predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Cross-Modal Generative Augmentation for Visual Question Answering

111 - Zixu Wang , Yishu Miao , Lucia Specia 2021

Data augmentation is an approach that can effectively improve the performance of multimodal machine learning. This paper introduces a generative model for data augmentation by leveraging the correlations among multiple modalities. Different from conv entional data augmentation approaches that apply low level operations with deterministic heuristics, our method proposes to learn an augmentation sampler that generates samples of the target modality conditioned on observed modalities in the variational auto-encoder framework. Additionally, the proposed model is able to quantify the confidence of augmented data by its generative probability, and can be jointly updated with a downstream pipeline. Experiments on Visual Question Answering tasks demonstrate the effectiveness of the proposed generative model, which is able to boost the strong UpDn-based models to the state-of-the-art performance.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Assessing the Robustness of Visual Question Answering

120 - Jia-Hong Huang , Modar Alfadly , Bernard Ghanem 2019

Deep neural networks have been playing an essential role in the task of Visual Question Answering (VQA). Until recently, their accuracy has been the main focus of research. Now there is a trend toward assessing the robustness of these models against adversarial attacks by evaluating the accuracy of these models under increasing levels of noisiness in the inputs of VQA models. In VQA, the attack can target the image and/or the proposed query question, dubbed main question, and yet there is a lack of proper analysis of this aspect of VQA. In this work, we propose a new method that uses semantically related questions, dubbed basic questions, acting as noise to evaluate the robustness of VQA models. We hypothesize that as the similarity of a basic question to the main question decreases, the level of noise increases. To generate a reasonable noise level for a given main question, we rank a pool of basic questions based on their similarity with this main question. We cast this ranking problem as a LASSO optimization problem. We also propose a novel robustness measure Rscore and two large-scale basic question datasets in order to standardize robustness analysis of VQA models. The experimental results demonstrate that the proposed evaluation method is able to effectively analyze the robustness of VQA models. To foster the VQA research, we will publish our proposed datasets.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Knowledge-Based Visual Question Answering in Videos

106 - Noa Garcia , Mayu Otani , Chenhui Chu 2020

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

VQABQ: Visual Question Answering by Basic Questions

122 - Jia-Hong Huang , Modar Alfadly , Bernard Ghanem 2017

Taking an image and question as the input of our method, it can output the text-based answer of the query question about the given image, so called Visual Question Answering (VQA). There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the basic questions of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question. We formulate the basic questions generation problem as a LASSO optimization problem, and also propose a criterion about how to exploit these basic questions to help answer main question. Our method is evaluated on the challenging VQA dataset and yields state-of-the-art accuracy, 60.34% in open-ended task.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة أسيوط

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Simple Baseline for Visual Question Answering

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً