Do you want to publish a course? Click here

Image Captioning

توليد توصيف نصي للصور

1160   0   121   0 ( 0 )
 Publication date 2018
and research's language is العربية
 Created by adam oudaimah




Ask ChatGPT about the research

No English abstract


Artificial intelligence review:
Research summary
تتناول هذه الورقة البحثية تطوير نموذج جديد قائم على التركيز (attention-based model) لتوليد توصيف نصي للصور. يتم تدريب النموذج باستخدام تقنيات التراجع الخلفي (backpropagation) وتعظيم حد سفلي متغير عشوائيًا (variational lower bound). يتم استخدام قاعدة بيانات MS COCO لتدريب النموذج. تعتمد الورقة على استخدام الشبكات العصبونية الملتفة (CNN) لاستخلاص تمثيلات شعاعية للصور والشبكات العصبونية التكرارية (RNN) لتوليد التوصيف النصي. يتم التركيز على أهمية التركيز في أنظمة الرؤية البشرية وكيف يمكن للنموذج تصحيح الأخطاء عند توليد كلمات غير متوافقة مع الكائنات الموجودة في الصورة. يتم شرح معمارية الشبكات العصبونية الملتفة والتكرارية بالتفصيل، بالإضافة إلى كيفية تدريب النموذج باستخدام مكتبة TensorFlow. يتم تقديم نتائج التدريب على 10000 صورة من قاعدة بيانات MS COCO، حيث بلغت نسبة الدقة حوالي 70%.
Critical review
تعد الورقة البحثية مساهمة قيمة في مجال توليد التوصيف النصي للصور باستخدام نماذج التركيز. ومع ذلك، يمكن تحسينها من خلال تقديم تحليل أعمق لأداء النموذج على مجموعات بيانات مختلفة وتقديم مقارنة مع نماذج أخرى مشابهة. كما يمكن تحسين الورقة من خلال تقديم تفاصيل أكثر حول كيفية تحسين النموذج للتعامل مع الصور ذات التعقيد العالي. بالإضافة إلى ذلك، يمكن تحسين الورقة من خلال تقديم تحليل أعمق للأخطاء التي يرتكبها النموذج وكيفية تصحيحها.
Questions related to the research
  1. ما هي التقنية المستخدمة لتدريب النموذج في الورقة البحثية؟

    يتم تدريب النموذج باستخدام تقنيات التراجع الخلفي (backpropagation) وتعظيم حد سفلي متغير عشوائيًا (variational lower bound).

  2. ما هي قاعدة البيانات المستخدمة لتدريب النموذج؟

    تم استخدام قاعدة بيانات MS COCO لتدريب النموذج.

  3. ما هي نسبة الدقة التي حققها النموذج المدرب على قاعدة بيانات MS COCO؟

    بلغت نسبة الدقة حوالي 70%.

  4. ما هي الشبكات العصبونية المستخدمة في النموذج لتوليد التوصيف النصي؟

    تم استخدام الشبكات العصبونية الملتفة (CNN) لاستخلاص تمثيلات شعاعية للصور والشبكات العصبونية التكرارية (RNN) لتوليد التوصيف النصي.


References used
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu. 2016
A Critical Review of Recurrent Neural Networks for Sequence Learning. Zachary C. Lipton, John Berkowitz, Charles Elkan. June 5th, 2015
CS231n Convolutional Neural Networks for Visual Recognition
rate research

Read More

Modern web content - news articles, blog posts, educational resources, marketing brochures - is predominantly multimodal. A notable trait is the inclusion of media such as images placed at meaningful locations within a textual narrative. Most often, such images are accompanied by captions - either factual or stylistic (humorous, metaphorical, etc.) - making the narrative more engaging to the reader. While standalone image captioning has been extensively studied, captioning an image based on external knowledge such as its surrounding text remains under-explored. In this paper, we study this new task: given an image and an associated unstructured knowledge snippet, the goal is to generate a contextual caption for the image.
The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow jou rnalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associated with. In this work, we propose a new approach to this task, motivated by caption guidelines that journalists follow. Our approach, Journalistic Guidelines Aware News Image Captioning (JoGANIC), leverages the structure of captions to improve the generation quality and guide our representation design. Experimental results, including detailed ablation studies, on two large-scale publicly available datasets show that JoGANIC substantially outperforms state-of-the-art methods both on caption generation and named entity related metrics.
This research will show a sturdy method to hide a text file into an image using least significant bit algorithm and encrypting this text, which allows to store English and Arabic texts with various sizes and ensure that the text file is delivered correctly and secretly.
We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.
Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In t his paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا