بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Are we asking the right questions in MovieQA?

94 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Bhavan Jasani

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Bhavan Jasani - Rohit Girdhar - Deva Ramanan

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Joint vision and language tasks like visual question answering are fascinating because they explore high-level understanding, but at the same time, can be more prone to language biases. In this paper, we explore the biases in the MovieQA dataset and propose a strikingly simple model which can exploit them. We find that using the right word embedding is of utmost importance. By using an appropriately trained word embedding, about half the Question-Answers (QAs) can be answered by looking at the questions and answers alone, completely ignoring narrative context from video clips, subtitles, and movie scripts. Compared to the best published papers on the leaderboard, our simple question + answer only model improves accuracy by 5% for video + subtitle category, 5% for subtitle, 15% for DVS and 6% higher for scripts.

قيم البحث

131 - Amanpreet Singh , Vedanuj Goswami , Devi Parikh 2020

Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice o f pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than natural data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work out of the box yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

MovieQA: Understanding Stories in Movies through Question-Answering

179 - Makarand Tapaswi , Yukun Zhu , Rainer Stiefelhagen 2015

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from simpler Who did What to W hom, to Why and How certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Interactive Classification by Asking Informative Questions

155 - Lili Yu , Howard Chen , Sida Wang 2019

We study the potential for interaction in natural language classification. We add a limited form of interaction for intent classification, where users provide an initial query using natural language, and the system asks for additional information usi ng binary or multi-choice questions. At each turn, our system decides between asking the most informative question or making the final classification prediction.The simplicity of the model allows for bootstrapping of the system without interaction data, instead relying on simple crowdsourcing tasks. We evaluate our approach on two domains, showing the benefit of interaction and the advantage of learning to balance between asking additional questions and making the final prediction.

الحساب واللغة تفاعل الإنسان والحاسوب استرجاع المعلومات

Are we done with ImageNet?

200 - Lucas Beyer , Olivier J. Henaff , Alexander Kolesnikov andn Xiaohua Zhai 2020

Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore devel op a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

QACE: Asking Questions to Evaluate an Image Caption

100 - Hwanhee Lee , Thomas Scialom , Seunghyun Yoon 2021

In this paper, we propose QACE, a new metric based on Question Answering for Caption Evaluation. QACE generates questions on the evaluated caption and checks its content by asking the questions on either the reference caption or the source image. We first develop QACE-Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE-Img, which asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE-Img. Unfortunately, the standard VQA models are framed as a classification among only a few thousand categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE-Img is multi-modal, reference-less, and explainable. Our experiments show that QACE-Img compares favorably w.r.t. other reference-less metrics. We will release the pre-trained models to compute QACE.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد العالي للدراسات والبحوث السكانية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Are we asking the right questions in MovieQA?

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً