Policy Perturbation via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods


الملخص بالإنكليزية

Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which is is not friendly to algorithmic utility. By contrast, we find that MAPPO faces the problem of textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled shared advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to solve the POMAC problem, we propose two novel policy perturbation methods, i.e, Noisy-Value MAPPO (NV-MAPPO) and Noisy-Advantage MAPPO (NA-MAPPO), which disturb the advantage values via random Gaussian noise. The experimental results show that our methods outperform the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at url{https://github.com/hijkzzz/noisy-mappo}.

تحميل البحث