بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

184 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Paul Hongsuck Seo

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Paul Hongsuck Seo - Piyush Sharma - Tomer Levinboim

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.

قيم البحث

104 - Marc Tanti 2017

When a recurrent neural network language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN -- conditioning the language model by `injecting image features -- or in a layer following the RNN -- conditioning the language model by `merging image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNNs hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.

الحوسبة العصبية والتطورية الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Towards Diverse and Accurate Image Captions via Reinforcing Determinantal Point Process

157 - Qingzhong Wang , Antoni B. Chan 2019

Although significant progress has been made in the field of automatic image captioning, it is still a challenging task. Previous works normally pay much attention to improving the quality of the generated captions but ignore the diversity of captions . In this paper, we combine determinantal point process (DPP) and reinforcement learning (RL) and propose a novel reinforcing DPP (R-DPP) approach to generate a set of captions with high quality and diversity for an image. We show that R-DPP performs better on accuracy and diversity than using noise as a control signal (GANs, VAEs). Moreover, R-DPP is able to preserve the modes of the learned distribution. Hence, beam search algorithm can be applied to generate a single accurate caption, which performs better than other RL-based models.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

111 - Marc Tanti , Albert Gatt , Kenneth P. Camilleri 2017

In neural image captioning systems, a recurrent neural network (RNN) is typically viewed as the primary `generation component. This view suggests that the image features should be `injected into the RNN. This is in fact the dominant view in the liter ature. Alternatively, the RNN can instead be viewed as only encoding the previously generated words. This view suggests that the RNN should only be used to encode linguistic features and that only the final representation should be `merged with the image features at a later stage. This paper compares these two architectures. We find that, in general, late merging outperforms injection, suggesting that RNNs are better viewed as encoders, rather than generators.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط الحوسبة العصبية والتطورية

QACE: Asking Questions to Evaluate an Image Caption

100 - Hwanhee Lee , Thomas Scialom , Seunghyun Yoon 2021

In this paper, we propose QACE, a new metric based on Question Answering for Caption Evaluation. QACE generates questions on the evaluated caption and checks its content by asking the questions on either the reference caption or the source image. We first develop QACE-Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE-Img, which asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE-Img. Unfortunately, the standard VQA models are framed as a classification among only a few thousand categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE-Img is multi-modal, reference-less, and explainable. Our experiments show that QACE-Img compares favorably w.r.t. other reference-less metrics. We will release the pre-trained models to compute QACE.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Real-Time Sign Language Detection using Human Pose Estimation

362 - Amit Moryossef , Ioannis Tsochantaridis , Roee Aharoni 2020

We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are me aningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accuracy, while still working under 4ms. We describe a demo application to sign language detection in the browser in order to demonstrate its usage possibility in videoconferencing applications.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة اليرموك الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً