ترغب بنشر مسار تعليمي؟ اضغط هنا

Policy Perturbation via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

138   0   0.0 ( 0 )
 نشر من قبل Jian Hu
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which is is not friendly to algorithmic utility. By contrast, we find that MAPPO faces the problem of textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled shared advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to solve the POMAC problem, we propose two novel policy perturbation methods, i.e, Noisy-Value MAPPO (NV-MAPPO) and Noisy-Advantage MAPPO (NA-MAPPO), which disturb the advantage values via random Gaussian noise. The experimental results show that our methods outperform the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at url{https://github.com/hijkzzz/noisy-mappo}.

قيم البحث

اقرأ أيضاً

Coordination is one of the essential problems in multi-agent systems. Typically multi-agent reinforcement learning (MARL) methods treat agents equally and the goal is to solve the Markov game to an arbitrary Nash equilibrium (NE) when multiple equili bra exist, thus lacking a solution for NE selection. In this paper, we treat agents emph{unequally} and consider Stackelberg equilibrium as a potentially better convergence point than Nash equilibrium in terms of Pareto superiority, especially in cooperative environments. Under Markov games, we formally define the bi-level reinforcement learning problem in finding Stackelberg equilibrium. We propose a novel bi-level actor-critic learning method that allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find an asymmetric solution in a highway merge environment.
With the adoption of autonomous vehicles on our roads, we will witness a mixed-autonomy environment where autonomous and human-driven vehicles must learn to co-exist by sharing the same road infrastructure. To attain socially-desirable behaviors, aut onomous vehicles must be instructed to consider the utility of other vehicles around them in their decision-making process. Particularly, we study the maneuver planning problem for autonomous vehicles and investigate how a decentralized reward structure can induce altruism in their behavior and incentivize them to account for the interest of other autonomous and human-driven vehicles. This is a challenging problem due to the ambiguity of a human drivers willingness to cooperate with an autonomous vehicle. Thus, in contrast with the existing works which rely on behavior models of human drivers, we take an end-to-end approach and let the autonomous agents to implicitly learn the decision-making process of human drivers only from experience. We introduce a multi-agent variant of the synchronous Advantage Actor-Critic (A2C) algorithm and train agents that coordinate with each other and can affect the behavior of human drivers to improve traffic flow and safety.
123 - Ryan Lowe , Yi Wu , Aviv Tamar 2017
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
142 - Zheqi Zhu , Shuo Wan , Pingyi Fan 2020
As an emerging technique, mobile edge computing (MEC) introduces a new processing scheme for various distributed communication-computing systems such as industrial Internet of Things (IoT), vehicular communication, smart city, etc. In this work, we m ainly focus on the timeliness of the MEC systems where the freshness of the data and computation tasks is significant. Firstly, we formulate a kind of age-sensitive MEC models and define the average age of information (AoI) minimization problems of interests. Then, a novel policy based multi-agent deep reinforcement learning (RL) framework, called heterogeneous multi-agent actor critic (H-MAAC), is proposed as a paradigm for joint collaboration in the investigated MEC systems, where edge devices and center controller learn the interactive strategies through their own observations. To improves the system performance, we develop the corresponding online algorithm by introducing an edge federated learning mode into the multi-agent cooperation whose advantages on learning convergence can be guaranteed theoretically. To the best of our knowledge, its the first joint MEC collaboration algorithm that combines the edge federated mode with the multi-agent actor-critic reinforcement learning. Furthermore, we evaluate the proposed approach and compare it with classical RL based methods. As a result, the proposed framework not only outperforms the baseline on average system age, but also promotes the stability of training process. Besides, the simulation results provide some innovative perspectives for the system design under the edge federated collaboration.
This paper extends off-policy reinforcement learning to the multi-agent case in which a set of networked agents communicating with their neighbors according to a time-varying graph collaboratively evaluates and improves a target policy while followin g a distinct behavior policy. To this end, the paper develops a multi-agent version of emphatic temporal difference learning for off-policy policy evaluation, and proves convergence under linear function approximation. The paper then leverages this result, in conjunction with a novel multi-agent off-policy policy gradient theorem and recent work in both multi-agent on-policy and single-agent off-policy actor-critic methods, to develop and give convergence guarantees for a new multi-agent off-policy actor-critic algorithm.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا