No Arabic abstract
Numerous deep reinforcement learning agents have been proposed, and each of them has its strengths and flaws. In this work, we present a Cooperative Heterogeneous Deep Reinforcement Learning (CHDRL) framework that can learn a policy by integrating the advantages of heterogeneous agents. Specifically, we propose a cooperative learning framework that classifies heterogeneous agents into two classes: global agents and local agents. Global agents are off-policy agents that can utilize experiences from the other agents. Local agents are either on-policy agents or population-based evolutionary algorithms (EAs) agents that can explore the local area effectively. We employ global agents, which are sample-efficient, to guide the learning of local agents so that local agents can benefit from sample-efficient agents and simultaneously maintain their advantages, e.g., stability. Global agents also benefit from effective local searches. Experimental studies on a range of continuous control tasks from the Mujoco benchmark show that CHDRL achieves better performance compared with state-of-the-art baselines.
In this paper, we study how to learn an appropriate lane changing strategy for autonomous vehicles by using deep reinforcement learning. We show that the reward of the system should consider the overall traffic efficiency instead of the travel efficiency of an individual vehicle. In summary, cooperation leads to a more harmonic and efficient traffic system rather than competition
Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{mathrm{pop}}$ heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem. We consider three scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of $(1)$ joint state and action distributions across all classes, $(2)$ individual distributions of each class, and $(3)$ marginal distributions of the entire population. We show that, in these cases, the $K$-class MARL problem can be approximated by MFC with errors given as $e_1=mathcal{O}(frac{sqrt{|mathcal{X}||mathcal{U}|}}{N_{mathrm{pop}}}sum_{k}sqrt{N_k})$, $e_2=mathcal{O}(sqrt{|mathcal{X}||mathcal{U}|}sum_{k}frac{1}{sqrt{N_k}})$ and $e_3=mathcal{O}left(sqrt{|mathcal{X}||mathcal{U}|}left[frac{A}{N_{mathrm{pop}}}sum_{kin[K]}sqrt{N_k}+frac{B}{sqrt{N_{mathrm{pop}}}}right]right)$, respectively, where $A, B$ are some constants and $|mathcal{X}|,|mathcal{U}|$ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within $mathcal{O}(e_j)$ error with a sample complexity of $mathcal{O}(e_j^{-3})$, $jin{1,2,3}$, respectively.
Communication is a important factor that enables agents work cooperatively in multi-agent reinforcement learning (MARL). Most previous work uses continuous message communication whose high representational capacity comes at the expense of interpretability. Allowing agents to learn their own discrete message communication protocol emerged from a variety of domains can increase the interpretability for human designers and other agents.This paper proposes a method to generate discrete messages analogous to human languages, and achieve communication by a broadcast-and-listen mechanism based on self-attention. We show that discrete message communication has performance comparable to continuous message communication but with much a much smaller vocabulary size.Furthermore, we propose an approach that allows humans to interactively send discrete messages to agents.
We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD($k$) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.
Training a multi-agent reinforcement learning (MARL) algorithm is more challenging than training a single-agent reinforcement learning algorithm, because the result of a multi-agent task strongly depends on the complex interactions among agents and their interactions with a stochastic and dynamic environment. We propose an algorithm that boosts MARL training using the biased action information of other agents based on a friend-or-foe concept. For a cooperative and competitive environment, there are generally two groups of agents: cooperative-agents and competitive-agents. In the proposed algorithm, each agent updates its value function using its own action and the biased action information of other agents in the two groups. The biased joint action of cooperative agents is computed as the sum of their actual joint action and the imaginary cooperative joint action, by assuming all the cooperative agents jointly maximize the target agents value function. The biased joint action of competitive agents can be computed similarly. Each agent then updates its own value function using the biased action information, resulting in a biased value function and corresponding biased policy. Subsequently, the biased policy of each agent is inevitably subjected to recommend an action to cooperate and compete with other agents, thereby introducing more active interactions among agents and enhancing the MARL policy learning. We empirically demonstrate that our algorithm outperforms existing algorithms in various mixed cooperative-competitive environments. Furthermore, the introduced biases gradually decrease as the training proceeds and the correction based on the imaginary assumption vanishes.