Learning Implicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning

224 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ziyu Liu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Meng Zhou - Ziyu Liu - Pengwei Sui

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present a multi-agent actor-critic method that aims to implicitly address the credit assignment problem under fully cooperative settings. Our key motivation is that credit assignment among agents may not require an explicit formulation as long as (1) the policy gradients derived from a centralized critic carry sufficient information for the decentralized agents to maximize their joint action value through optimal cooperation and (2) a sustained level of exploration is enforced throughout training. Under the centralized training with decentralized execution (CTDE) paradigm, we achieve the former by formulating the centralized critic as a hypernetwork such that a latent state representation is integrated into the policy gradients through its multiplicative association with the stochastic policies; to achieve the latter, we derive a simple technique called adaptive entropy regularization where magnitudes of the entropy gradients are dynamically rescaled based on the current policy stochasticity to encourage consistent levels of exploration. Our algorithm, referred to as LICA, is evaluated on several benchmarks including the multi-agent particle environments and a set of challenging StarCraft II micromanagement tasks, and we show that LICA significantly outperforms previous methods.

قيم البحث

446 - Baiming Chen , Mengdi Xu , Zuxin Liu 2020

Action and observation delays exist prevalently in the real-world cyber-physical systems which may pose challenges in reinforcement learning design. It is particularly an arduous task when handling multi-agent systems where the delay of one agent cou ld spread to other agents. To resolve this problem, this paper proposes a novel framework to deal with delays as well as the non-stationary training issue of multi-agent tasks with model-free deep reinforcement learning. We formally define the Delay-Aware Markov Game that incorporates the delays of all agents in the environment. To solve Delay-Aware Markov Games, we apply centralized training and decentralized execution that allows agents to use extra information to ease the non-stationarity issue of the multi-agent systems during training, without the need of a centralized controller during execution. Experiments are conducted in multi-agent particle environments including cooperative communication, cooperative navigation, and competitive experiments. We also test the proposed algorithm in traffic scenarios that require coordination of all autonomous vehicles to show the practical value of delay-awareness. Results show that the proposed delay-aware multi-agent reinforcement learning algorithm greatly alleviates the performance degradation introduced by delay. Codes and demo videos are available at: https://github.com/baimingc/delay-aware-MARL.

التعلم الآلي أنظمة متعددة العملاء التعلم الالي

Provably Efficient Cooperative Multi-Agent Reinforcement Learning with Function Approximation

105 - Abhimanyu Dubey , Alex Pentland 2021

Reinforcement learning in cooperative multi-agent settings has recently advanced significantly in its scope, with applications in cooperative estimation for advertising, dynamic treatment regimes, distributed control, and federated learning. In this paper, we discuss the problem of cooperative multi-agent RL with function approximation, where a group of agents communicates with each other to jointly solve an episodic MDP. We demonstrate that via careful message-passing and cooperative value iteration, it is possible to achieve near-optimal no-regret learning even with a fixed constant communication budget. Next, we demonstrate that even in heterogeneous cooperative settings, it is possible to achieve Pareto-optimal no-regret learning with limited communication. Our work generalizes several ideas from the multi-agent contextual and multi-armed bandit literature to MDPs and reinforcement learning.

التعلم الآلي أنظمة متعددة العملاء التعلم الالي

A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

117 - Nicolas Carion , Gabriel Synnaeve , Alessandro Lazaric 2019

Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenario s different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.

التعلم الآلي أنظمة متعددة العملاء التعلم الالي

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

160 - Jianzhun Shao , Hongchang Zhang , Yuhang Jiang 2021

Reward decomposition is a critical problem in centralized training with decentralized execution~(CTDE) paradigm for multi-agent reinforcement learning. To take full advantage of global information, which exploits the states from all agents and the re lated environment for decomposing Q values into individual credits, we propose a general meta-learning-based Mixing Network with Meta Policy Gradient~(MNMPG) framework to distill the global hierarchy for delicate reward decomposition. The excitation signal for learning global hierarchy is deduced from the episode reward difference between before and after exercise updates through the utility network. Our method is generally applicable to the CTDE method using a monotonic mixing network. Experiments on the StarCraft II micromanagement benchmark demonstrate that our method just with a simple utility network is able to outperform the current state-of-the-art MARL algorithms on 4 of 5 super hard scenarios. Better performance can be further achieved when combined with a role-based utility network.

التعلم الآلي الذكاء الاصطناعي أنظمة متعددة العملاء

Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

327 - Tianze Zhou , Fubiao Zhang , Kun Shao 2021

Extending transfer learning to cooperative multi-agent reinforcement learning (MARL) has recently received much attention. In contrast to the single-agent setting, the coordination indispensable in cooperative MARL constrains each agents policy. Howe ver, existing transfer methods focus exclusively on agent policy and ignores coordination knowledge. We propose a new architecture that realizes robust coordination knowledge transfer through appropriate decomposition of the overall coordination into several coordination patterns. We use a novel mixing network named level-adaptive QTransformer (LA-QTransformer) to realize agent coordination that considers credit assignment, with appropriate coordination patterns for different agents realized by a novel level-adaptive Transformer (LA-Transformer) dedicated to the transfer of coordination knowledge. In addition, we use a novel agent network named Population Invariant agent with Transformer (PIT) to realize the coordination transfer in more varieties of scenarios. Extensive experiments in StarCraft II micro-management show that LA-QTransformer together with PIT achieves superior performance compared with state-of-the-art baselines.

الذكاء الاصطناعي