ﻻ يوجد ملخص باللغة العربية
Exploring fine-grained relationship between entities(e.g. objects in image or words in sentence) has great contribution to understand multimedia content precisely. Previous attention mechanism employed in image-text matching either takes multiple self attention steps to gather correspondences or uses image objects (or words) as context to infer image-text similarity. However, they only take advantage of semantic information without considering that objects relative position also contributes to image understanding. To this end, we introduce a novel position-aware relation module to model both the semantic and spatial relationship simultaneously for image-text matching in this paper. Given an image, our method utilizes the location of different objects to capture spatial relationship innovatively. With the combination of semantic and spatial relationship, its easier to understand the content of different modalities (images and sentences) and capture fine-grained latent correspondences of image-text pairs. Besides, we employ a two-step aggregated relation module to capture interpretable alignment of image-text pairs. The first step, we call it intra-modal relation mechanism, in which we computes responses between different objects in an image or different words in a sentence separately; The second step, we call it inter-modal relation mechanism, in which the query plays a role of textual context to refine the relationship among object proposals in an image. In this way, our position-aware aggregated relation network (ParNet) not only knows which entities are relevant by attending on different objects (words) adaptively, but also adjust the inter-modal correspondence according to the latent alignments according to querys content. Our approach achieves the state-of-the-art results on MS-COCO dataset.
Image-text matching tasks have recently attracted a lot of attention in the computer vision field. The key point of this cross-domain problem is how to accurately measure the similarity between the visual and the textual contents, which demands a fin
Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics,
Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most p
Entity alignment which aims at linking entities with the same meaning from different knowledge graphs (KGs) is a vital step for knowledge fusion. Existing research focused on learning embeddings of entities by utilizing structural information of KGs
Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that