Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection

149 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xiaodan Liang

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xiaodan Liang - Lisa Lee - Eric P. Xing

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Despite progress in visual perception tasks such as image classification and detection, computers still struggle to understand the interdependency of objects in the scene as a whole, e.g., relations between objects or their attributes. Existing methods often ignore global context cues capturing the interactions among different object instances, and can only recognize a handful of types by exhaustively training individual detectors for all possible relationships. To capture such global interdependency, we propose a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in the whole image. First, a directed semantic action graph is built using language priors to provide a rich and compact representation of semantic correlations between object categories, predicates, and attributes. Next, we use a variation-structured traversal over the action graph to construct a small, adaptive action set for each step based on the current state and historical actions. In particular, an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish. We then make sequential predictions using a deep RL framework, incorporating global context cues and semantic embeddings of previously extracted phrases in the state vector. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome dataset validate the superiority of VRL, which can achieve significantly better detection results on datasets involving thousands of relationship and attribute types. We also demonstrate that VRL is able to predict unseen types embedded in our action graph by learning correlations on shared graph nodes.

قيم البحث

148 - Qiuyuan Huang , Zhe Gan , Asli Celikyilmaz 2018

We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i.e., topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results from both automatic and human evaluations demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a strong flat deep reinforcement learning baseline.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations

129 - Meng-Jiun Chiou , Roger Zimmermann , Jiashi Feng 2020

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense kno wledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code is available at https://github.com/coldmanck/RVL-BERT.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي الحساب واللغة

AgentGraph: Towards Universal Dialogue Management with Structured Deep Reinforcement Learning

138 - Lu Chen , Zhi Chen , Bowen Tan 2019

Dialogue policy plays an important role in task-oriented spoken dialogue systems. It determines how to respond to users. The recently proposed deep reinforcement learning (DRL) approaches have been used for policy optimization. However, these deep mo dels are still challenging for two reasons: 1) Many DRL-based policies are not sample-efficient. 2) Most models dont have the capability of policy transfer between different domains. In this paper, we propose a universal framework, AgentGraph, to tackle these two problems. The proposed AgentGraph is the combination of GNN-based architecture and DRL-based algorithm. It can be regarded as one of the multi-agent reinforcement learning approaches. Each agent corresponds to a node in a graph, which is defined according to the dialogue domain ontology. When making a decision, each agent can communicate with its neighbors on the graph. Under AgentGraph framework, we further propose Dual GNN-based dialogue policy, which implicitly decomposes the decision in each turn into a high-level global decision and a low-level local decision. Experiments show that AgentGraph models significantly outperform traditional reinforcement learning approaches on most of the 18 tasks of the PyDial benchmark. Moreover, when transferred from the source task to a target task, these models not only have acceptable initial performance but also converge much faster on the target task.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Deep Bregman Divergence for Contrastive Learning of Visual Representations

184 - Mina Rezaei , Farzin Soleymani , Bernd Bischl 2021

Deep Bregman divergence measures divergence of data points using neural networks which is beyond Euclidean distance and capable of capturing divergence over distributions. In this paper, we propose deep Bregman divergences for contrastive learning of visual representation and we aim to enhance contrastive loss used in self-supervised learning by training additional networks based on functional Bregman divergence. In contrast to the conventional contrastive learning methods which are solely based on divergences between single points, our framework can capture the divergence between distributions which improves the quality of learned representation. By combining conventional contrastive loss with the proposed divergence loss, our method outperforms baseline and most of previous methods for self-supervised and semi-supervised learning on multiple classifications and object detection tasks and datasets. The source code of the method and of all the experiments are available at supplementary.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

2.5D Visual Relationship Detection

187 - Yu-Chuan Su , Soravit Changpinyo , Xiangning Chen 2021

Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semant ics. To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is egocentric, using the cameras viewpoint as a common reference for all 2.5D relationships. Unlike depth estimation, 2.5VRD is object-centric and not only focuses on depth. To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images. We analyze this dataset and conduct extensive experiments including benchmarking multiple state-of-the-art VRD models on this task. Our results show that existing models largely rely on semantic cues and simple heuristics to solve 2.5VRD, motivating further research on models for 2.5D perception. The new dataset is available at https://github.com/google-research-datasets/2.5vrd.

الرؤية الحاسوبية وتمييز الأنماط