ﻻ يوجد ملخص باللغة العربية
Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model. For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i.e. online hard negative). This strategy improves the models capacity to discover fine-grained correspondences and non-correspondences between image and text inputs. However, the above approach has the following drawbacks: (1) the negative selection strategy still provides limited chances for the model to learn from very hard-to-distinguish cases. (2) The trained model has weak generalization capability from the training set to the testing set. (3) The penalty lacks hierarchy and adaptiveness for hard negatives with different hardness degrees. In this paper, we propose solutions by sampling negatives offline from the whole training set. It provides harder offline negatives than online hard negatives for the model to distinguish. Based on the offline hard negatives, a quintuplet loss is proposed to improve the models generalization capability to distinguish positives and negatives. In addition, a novel loss function that combines the knowledge of positives, offline hard negatives and online hard negatives is created. It leverages offline hard negatives as the intermediary to adaptively penalize them based on their distance relations to the anchor. We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets. Significant performance improvements are observed for all the models, proving the effectiveness and generality of our approach. Code is available at https://github.com/sunnychencool/AOQ
Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this
Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics,
Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most p
Exploring fine-grained relationship between entities(e.g. objects in image or words in sentence) has great contribution to understand multimedia content precisely. Previous attention mechanism employed in image-text matching either takes multiple sel
Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objec