MAVEN: Multi-Agent Variational Exploration

109 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anuj Mahajan

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Anuj Mahajan - Tabish Rashid - Mikayel Samvelyan

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Centralised training with decentralised execution is an important setting for cooperative deep multi-agent reinforcement learning due to communication constraints during execution and computational tractability in training. In this paper, we analyse value-based methods that are known to have superior performance in complex environments [43]. We specifically focus on QMIX [40], the current state-of-the-art in this domain. We show that the representational constraints on the joint action-values introduced by QMIX and similar methods lead to provably poor exploration and suboptimality. Furthermore, we propose a novel approach called MAVEN that hybridises value and policy-based methods by introducing a latent space for hierarchical control. The value-based agents condition their behaviour on the shared latent variable controlled by a hierarchical policy. This allows MAVEN to achieve committed, temporally extended exploration, which is key to solving complex multi-agent tasks. Our experimental results show that MAVEN achieves significant performance improvements on the challenging SMAC domain [43].

قيم البحث

91 - Heechang Ryu , Hayong Shin , Jinkyoo Park 2020

Training a multi-agent reinforcement learning (MARL) model is generally difficult because there are numerous combinations of complex interactions among agents that induce certain reward signals. Especially when there is a sparse reward signal, the tr aining becomes more difficult. Previous studies have tried to resolve this issue by employing an intrinsic reward, which is a signal specifically designed for inducing the interactions among agents, to boost the MARL model training. However, this approach requires extensive prior knowledge to design an intrinsic reward. To optimize the training of an MARL model, we propose a learning-based exploration strategy to generate the initial states of a game. The proposed method adopts a variational graph autoencoder to represent a state of a game such that (1) the state can be compactly encoded to the latent representation by considering the relationship among agents, and (2) the latent representation can be used as an effective input to the surrogate model predicting the exploration score. The proposed method determines the latent representations that maximize the surrogate model and decodes these representations to generate the initial states from which the MARL model starts training. Empirically, we demonstrate that the generated states improve the training and performance of MARL more than the existing exploration methods.

التعلم الآلي الذكاء الاصطناعي أنظمة متعددة العملاء

Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

100 - Robert Loftin , Aadirupa Saha , Sam Devlin 2021

High sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty ca n significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.

التعلم الآلي الذكاء الاصطناعي أنظمة متعددة العملاء

UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning

97 - Tarun Gupta , Anuj Mahajan , Bei Peng 2020

VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this problem can be overcome by improving the joint exploration of all agents during training. Specifically, we propose a novel MARL approach called Universal Value Exploration (UneVEn) that learns a set of related tasks simultaneously with a linear decomposition of universal successor features. With the policies of already solved related tasks, the joint exploration process of all agents can be improved to help them achieve better coordination. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.

التعلم الآلي الذكاء الاصطناعي أنظمة متعددة العملاء

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning

299 - Shariq Iqbal , Fei Sha 2019

Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning. In the single-agent setting, this challenge is addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their sta te spaces; however, applying these techniques naively to the multi-agent setting results in agents exploring independently, without any coordination among themselves. Exploration in cooperative multi-agent settings can be accelerated and improved if agents coordinate their exploration. In this paper we introduce a framework for designing intrinsic rewards which consider what other agents have explored such that the agents can coordinate. Then, we develop an approach for learning how to dynamically select between several exploration modalities to maximize extrinsic rewards. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on diverse intrinsic rewards and the low-level controllers learn the action policies of all agents under these specific rewards. We demonstrate the effectiveness of the proposed approach in cooperative domains with sparse rewards where state-of-the-art methods fail and challenging multi-stage tasks that necessitate changing modes of coordination.

التعلم الآلي الذكاء الاصطناعي أنظمة متعددة العملاء

Multi-Task Variational Information Bottleneck

100 - Weizhu Qian , Bowei Chen , Yichao Zhang 2020

Multi-task learning (MTL) is an important subject in machine learning and artificial intelligence. Its applications to computer vision, signal processing, and speech recognition are ubiquitous. Although this subject has attracted considerable attenti on recently, the performance and robustness of the existing models to different tasks have not been well balanced. This article proposes an MTL model based on the architecture of the variational information bottleneck (VIB), which can provide a more effective latent representation of the input features for the downstream tasks. Extensive observations on three public data sets under adversarial attacks show that the proposed model is competitive to the state-of-the-art algorithms concerning the prediction accuracy. Experimental results suggest that combining the VIB and the task-dependent uncertainties is a very effective way to abstract valid information from the input features for accomplishing multiple tasks.

التعلم الآلي التعلم الالي