ﻻ يوجد ملخص باللغة العربية
Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets. The teacher model is constructed by incorporating the ground-truth image attributes into the baseline caption model. To effectively learn from the teacher model, we propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model. Experimental evaluations of several widely adopted caption models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in both training stages. TCTS is able to achieve to-date the best published single model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy test split. Our codes and pre-trained models will be open-sourced.
We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages. We identified three crucial factors that improve the performance, namely: multi-v
For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how
Most current image captioning systems focus on describing general image content, and lack background knowledge to deeply understand the image, such as exact named entities or concrete events. In this work, we focus on the entity-aware news image capt
Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resu