No Arabic abstract
Multi-agent reinforcement learning (MARL) has become effective in tackling discrete cooperative game scenarios. However, MARL has yet to penetrate settings beyond those modelled by team and zero-sum games, confining it to a small subset of multi-agent systems. In this paper, we introduce a new generation of MARL learners that can handle nonzero-sum payoff structures and continuous settings. In particular, we study the MARL problem in a class of games known as stochastic potential games (SPGs) with continuous state-action spaces. Unlike cooperative games, in which all agents share a common reward, SPGs are capable of modelling real-world scenarios where agents seek to fulfil their individual goals. We prove theoretically our learning method, SPot-AC, enables independent agents to learn Nash equilibrium strategies in polynomial time. We demonstrate our framework tackles previously unsolvable tasks such as Coordination Navigation and large selfish routing games and that it outperforms the state of the art MARL baselines such as MADDPG and COMIX in such scenarios.
Decentralized team problems where players have asymmetric information about the state of the underlying stochastic system have been actively studied, but games between such teams are less understood. We consider a general model of zero-sum stochastic games between two competing teams. This model subsumes many previously considered team and zero-sum game models. For this general model, we provide bounds on the upper (min-max) and lower (max-min) values of the game. Furthermore, if the upper and lower values of the game are identical (i.e., if the game has a value), our bounds coincide with the value of the game. Our bounds are obtained using two dynamic programs based on a sufficient statistic known as the common information belief (CIB). We also identify certain information structures in which only the minimizing team controls the evolution of the CIB. In these cases, we show that one of our CIB based dynamic programs can be used to find the min-max strategy (in addition to the min-max value). We propose an approximate dynamic programming approach for computing the values (and the strategy when applicable) and illustrate our results with the help of an example.
In this paper we deal with the problem of existence of a smooth solution of the Hamilton-Jacobi-Bellman-Isaacs (HJBI for short) system of equations associated with nonzero-sum stochastic differential games. We consider the problem in unbounded domains either in the case of continuous generators or for discontinuous ones. In each case we show the existence of a smooth solution of the system. As a consequence, we show that the game has smooth Nash payoffs which are given by means of the solution of the HJBI system and the stochastic process which governs the dynamic of the controlled system.
In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HSsqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. This improves the best existing regret bound of $O(sqrt[3]{DS^2AT^2})$ by Wei et. al., 2017 under the same assumption and matches the theoretical lower bound in $A$ and $T$.
We study a two-player nonzero-sum stochastic differential game where one player controls the state variable via additive impulses while the other player can stop the game at any time. The main goal of this work is characterize Nash equilibria through a verification theorem, which identifies a new system of quasi-variational inequalities whose solution gives equilibrium payoffs with the correspondent strategies. Moreover, we apply the verification theorem to a game with a one-dimensional state variable, evolving as a scaled Brownian motion, and with linear payoff and costs for both players. Two types of Nash equilibrium are fully characterized, i.e. semi-explicit expressions for the equilibrium strategies and associated payoffs are provided. Both equilibria are of threshold type: in one equilibrium players intervention are not simultaneous, while in the other one the first player induces her competitor to stop the game. Finally, we provide some numerical results describing the qualitative properties of both types of equilibrium.
Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e.g., Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD). At the trajectory distribution level, we re-define BD in the state-action space as the discrepancies of occupancy measures. For the reward dynamics, we propose RD to characterize diversity through the responses of policies when encountering different opponents. We also show that many current diversity measures fall in one of the categories of BD or RD but not both. With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning. We validate our methods in both relatively simple games like matrix game, non-transitive mixture model, and the complex textit{Google Research Football} environment. The population found by our methods reveals the lowest exploitability, highest population effectivity in matrix game and non-transitive mixture model, as well as the largest goal difference when interacting with opponents of various levels in textit{Google Research Football}.