ﻻ يوجد ملخص باللغة العربية
Relations amongst entities play a central role in image understanding. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE, we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object. Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate $approx$ union (subject, object) $-$ subject $-$ object. In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation.
Scene graph generation aims to identify objects and their relations in images, providing structured image representations that can facilitate numerous applications in computer vision. However, scene graph models usually require supervised learning on
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph
Visual navigation localizes a query place image against a reference database of place images, also known as a `visual map. Localization accuracy requirements for specific areas of the visual map, `scene classes, vary according to the context of the e
Prior work in scene graph generation requires categorical supervision at the level of triplets - subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene graph generation is a holistic tas
Objects and their relationships are critical contents for image understanding. A scene graph provides a structured description that captures these properties of an image. However, reasoning about the relationships between objects is very challenging