ﻻ يوجد ملخص باللغة العربية
In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes the sequentialized raw images as the input to Transformer. Compared to the CNN+Transformer design paradigm, our model can model global context at every encoder layer from the beginning and is totally convolution-free. Extensive experiments demonstrate the effectiveness of the proposed model and we surpass the conventional CNN+Transformer methods on the MSCOCO dataset. Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the words-to-patches attention in the decoder thanks to the full Transformer architecture.
Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes i
Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which
State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence betwe
Automatic transcription of scene understanding in images and videos is a step towards artificial general intelligence. Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques. Automated im
Existing image captioning methods just focus on understanding the relationship between objects or instances in a single image, without exploring the contextual correlation existed among contextual image. In this paper, we propose Dual Graph Convoluti