Do you want to publish a course? Click here

Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning

استرجاع، تشبيه، وتكوين: إطار للتعميم التركيبي في تقسيم الصور

432   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

Image captioning systems are expected to have the ability to combine individual concepts when describing scenes with concept combinations that are not observed during training. In spite of significant progress in image captioning with the help of the autoregressive generation framework, current approaches fail to generalize well to novel concept combinations. We propose a new framework that revolves around probing several similar image caption training instances (retrieval), performing analogical reasoning over relevant entities in retrieved prototypes (analogy), and enhancing the generation process with reasoning outcomes (composition). Our method augments the generation model by referring to the neighboring instances in the training set to produce novel concept combinations in generated captions. We perform experiments on the widely used image captioning benchmarks. The proposed models achieve substantial improvement over the compared baselines on both composition-related evaluation metrics and conventional image captioning metrics.



References used
https://aclanthology.org/
rate research

Read More

Modern web content - news articles, blog posts, educational resources, marketing brochures - is predominantly multimodal. A notable trait is the inclusion of media such as images placed at meaningful locations within a textual narrative. Most often, such images are accompanied by captions - either factual or stylistic (humorous, metaphorical, etc.) - making the narrative more engaging to the reader. While standalone image captioning has been extensively studied, captioning an image based on external knowledge such as its surrounding text remains under-explored. In this paper, we study this new task: given an image and an associated unstructured knowledge snippet, the goal is to generate a contextual caption for the image.
هدفنا من خلال هذه الدراسة في إطار المشروع الفصلي للسنة الرابعة إلى إلقاء الضوء على استرجاع الصور من مجموعة كبيرة بالاعتماد على محتوى صورة هدف , و قمنا بتدعيم هذه الدراسة بتطبيق ضمن بيئة الماتلاب لبرنامج بحث عن الصور المشابهة لصورة مدخلة . و قد تركز بحثنا على ميزتين هامتين يكاد لا يخلو منها أي نظام بحث عن الصور بالاعتماد على المحتوى و هما ميزتي الهيستوغرام اللوني و بنية الصورة texture , ووضحنا الخطوات التي يتم في ضوئها عملية الاسترجاع بدءاً من تحليل الصورة و استخلاص شعاع الواصفات الخاص فيها , و مطابقته مع أشعة الميزات الخاصة بالصور الموجودة في قاعدة البيانات ليتم ترتيب الصور بحسب مدى تشابهها من الصورة الهدف . و تطرقت الدراسة إلى استخدام الفضاء اللوني HMMD كبديل للفضاء اللوني RGB لاستخراج واصفات البنية اللونية على اعتبار أنه نموذج لوني موجه بالمستخدم user oriented و بالتالي نضمن أن نحصل على نتائج أفضل ترضي المستخدم . وقمنا بتدعيم الدراسة بعدد من الأشكال و الأمثلة و المخططات التي توضح محتوى الدراسة النظرية و ما قمنا بعمله في التطبيق ضمن بيئة الماتلاب .
Although neural sequence-to-sequence models have been successfully applied to semantic parsing, they fail at compositional generalization, i.e., they are unable to systematically generalize to unseen compositions of seen components. Motivated by trad itional semantic parsing where compositionality is explicitly accounted for by symbolic grammars, we propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models while featuring lexicon-style alignments and disentangled information processing. Specifically, we decompose decoding into two phases where an input utterance is first tagged with semantic symbols representing the meaning of individual words, and then a sequence-to-sequence model is used to predict the final meaning representation conditioning on the utterance and the predicted tag sequence. Experimental results on three semantic parsing datasets show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
Content Based Medical Image Retrieval (CBMIR) systems are a new technique which researchers aim to integrate with Computer Aided Diagnosis systems. These systems usually find and retrieve images from a large image-database which have a similar conten t to a query image. Retrieval is done by extracting the visual features from the query image, formulating them in a features vector, comparing features vector components with those of the images in the database, and then, similarity measures are computed. Based on the similarity measures, images which have a similar content to the query image are retrieved. The introduced analysis study surveys and analyzes the current status of the CBMIR systems, evaluates our findings from this survey, and concludes some specific research directions in this field.
We describe a span-level supervised attention loss that improves compositional generalization in semantic parsers. Our approach builds on existing losses that encourage attention maps in neural sequence-to-sequence models to imitate the output of cla ssical word alignment algorithms. Where past work has used word-level alignments, we focus on spans; borrowing ideas from phrase-based machine translation, we align subtrees in semantic parses to spans of input sentences, and encourage neural attention mechanisms to mimic these alignments. This method improves the performance of transformers, RNNs, and structured decoders on three benchmarks of compositional generalization.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا