Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Exploiting Image--Text Synergy for Contextual Image Captioning

استغلال الصورة - نص التآزر للحصول على تقسيم الصور السياقية

1169 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Modern web content - news articles, blog posts, educational resources, marketing brochures - is predominantly multimodal. A notable trait is the inclusion of media such as images placed at meaningful locations within a textual narrative. Most often, such images are accompanied by captions - either factual or stylistic (humorous, metaphorical, etc.) - making the narrative more engaging to the reader. While standalone image captioning has been extensively studied, captioning an image based on external knowledge such as its surrounding text remains under-explored. In this paper, we study this new task: given an image and an associated unstructured knowledge snippet, the goal is to generate a contextual caption for the image.

References used

https://aclanthology.org/

rate research

Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning

859 - Association for Computation Linguistics 2021 مقالة

Image captioning systems are expected to have the ability to combine individual concepts when describing scenes with concept combinations that are not observed during training. In spite of significant progress in image captioning with the help of the autoregressive generation framework, current approaches fail to generalize well to novel concept combinations. We propose a new framework that revolves around probing several similar image caption training instances (retrieval), performing analogical reasoning over relevant entities in retrieved prototypes (analogy), and enhancing the generation process with reasoning outcomes (composition). Our method augments the generation model by referring to the neighboring instances in the training set to produce novel concept combinations in generated captions. We perform experiments on the widely used image captioning benchmarks. The proposed models achieve substantial improvement over the compared baselines on both composition-related evaluation metrics and conventional image captioning metrics.

نص كوربوس image captioning systems نظم تسمية الصورة صناعة حمض الفوسفور

Journalistic Guidelines Aware News Image Captioning

827 - Association for Computation Linguistics 2021 مقالة

The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow jou rnalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associated with. In this work, we propose a new approach to this task, motivated by caption guidelines that journalists follow. Our approach, Journalistic Guidelines Aware News Image Captioning (JoGANIC), leverages the structure of captions to improve the generation quality and guide our representation design. Experimental results, including detailed ablation studies, on two large-scale publicly available datasets show that JoGANIC substantially outperforms state-of-the-art methods both on caption generation and named entity related metrics.

journalistic guidelines aware article image captioning المبادئ التوجيهية الصحفية تدرك صورة تقسيم الصورة صناعة حمض الفوسفور

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

1390 - Association for Computation Linguistics 2021 مقالة

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computational cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by extracting pre-cached feature indexes offline, and employing instant dot-product matching online, which significantly speeds up retrieval process. In fact, our LightningDOT achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets, outperforming existing pre-trained models that consume 1000 times magnitude of computational hours using the same features.

pre-training visual-semantic embeddings visual-semantic embeddings embeddings for real-time ما قبل التدريب Embeddings البصرية Embeddings البصرية الدلالية embeddings في الوقت الحقيقي صناعة حمض الفوسفور المزيد..

Visual News: Benchmark and Challenges in News Image Captioning

700 - Association for Computation Linguistics 2021 مقالة

We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.

آلة تفاعلية image captioning task تقسيم الصور المهمة صناعة حمض الفوسفور

Image Captioning

1645 - Tishreen University 2018 مشروع تخرج

بناء نظام ذكي يقوم بالتعرف على الأصناف الموجودة في صورة وتوليد توصيف نصي لهذه الأغراض الموجودة في الصورة. استخدمنا الشبكات العصبونية الملتفة Convolutional Neural Networks للقيام بعملية استخلاص الأصناف الموجودة في الصورة، وأدخلنا هذه الأصناف إلى شبكة عصبونية تكرارية Recurrent Neural Network للقيام بعملية توليد التوصيف النصي.

Deep Learning convolutional neural networks recurrent neural networks image captioning

Exploiting Image--Text Synergy for Contextual Image Captioning

استغلال الصورة - نص التآزر للحصول على تقسيم الصور السياقية

Ask ChatGPT about the research

Read More

suggested questions