Fictitious play in zero-sum stochastic games

355 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Muhammed Omer Sayin

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Muhammed O. Sayin - Francesca Parise - Asuman Ozdaglar

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present fictitious play dynamics for stochastic games and analyze its convergence properties in zero-sum stochastic games. Our dynamics involves players forming beliefs on opponent strategy and their own continuation payoff (Q-function), and playing a greedy best response using estimated continuation payoffs. Players update their beliefs from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of player payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.

قيم البحث

392 - Muhammed O. Sayin , Kaiqing Zhang , David S. Leslie 2021

We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized contro ller, but only based on their own payoffs and local actions executed. The agents need not observe the opponents actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop for the first time a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponents strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. The key challenge in this decentralized setting is the non-stationarity of the learning environment from an agents perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts their policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

علوم الكمبيوتر ونظرية الألعاب التعلم الآلي أنظمة متعددة العملاء

Towards General Function Approximation in Zero-Sum Markov Games

100 - Baihe Huang , Jason D. Lee , Zhaoran Wang 2021

This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorit hms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $sqrt{K}$-regret upper bound where $K$ is the number of episodes. Our algorithms are based on new techniques of alternate optimism.

علوم الكمبيوتر ونظرية الألعاب التعلم الآلي التعلم الالي

Convergence of Deep Fictitious Play for Stochastic Differential Games

135 - Jiequn Han , Ruimeng Hu , Jihao Long 2020

Stochastic differential games have been used extensively to model agents competitions in Finance, for instance, in P2P lending platforms from the Fintech industry, the banking system for systemic risk, and insurance markets. The recently proposed mac hine learning algorithm, deep fictitious play, provides a novel efficient tool for finding Markovian Nash equilibrium of large $N$-player asymmetric stochastic differential games [J. Han and R. Hu, Mathematical and Scientific Machine Learning Conference, pages 221-245, PMLR, 2020]. By incorporating the idea of fictitious play, the algorithm decouples the game into $N$ sub-optimization problems, and identifies each players optimal strategy with the deep backward stochastic differential equation (BSDE) method parallelly and repeatedly. In this paper, we prove the convergence of deep fictitious play (DFP) to the true Nash equilibrium. We can also show that the strategy based on DFP forms an $eps$-Nash equilibrium. We generalize the algorithm by proposing a new approach to decouple the games, and present numerical results of large population games showing the empirical convergence of the algorithm beyond the technical assumptions in the theorems.

التحسين والتحكم علوم الكمبيوتر ونظرية الألعاب التعلم الآلي

Learning to Resolve Alliance Dilemmas in Many-Player Zero-Sum Games

109 - Edward Hughes , Thomas W. Anthony , Tom Eccles 2020

Zero-sum games have long guided artificial intelligence research, since they possess both a rich strategy space of best-responses and a clear evaluation metric. Whats more, competition is a vital mechanism in many real-world multi-agent systems capab le of generating intelligent innovations: Darwinian evolution, the market economy and the AlphaZero algorithm, to name a few. In two-player zero-sum games, the challenge is usually viewed as finding Nash equilibrium strategies, safeguarding against exploitation regardless of the opponent. While this captures the intricacies of chess or Go, it avoids the notion of cooperation with co-players, a hallmark of the major transitions leading from unicellular organisms to human civilization. Beyond two players, alliance formation often confers an advantage; however this requires trust, namely the promise of mutual cooperation in the face of incentives to defect. Successful play therefore requires adaptation to co-players rather than the pursuit of non-exploitability. Here we argue that a systematic study of many-player zero-sum games is a crucial element of artificial intelligence research. Using symmetric zero-sum matrix games, we demonstrate formally that alliance formation may be seen as a social dilemma, and empirically that naive multi-agent reinforcement learning therefore fails to form alliances. We introduce a toy model of economic competition, and show how reinforcement learning may be augmented with a peer-to-peer contract mechanism to discover and enforce alliances. Finally, we generalize our agent model to incorporate temporally-extended contracts, presenting opportunities for further work.

علوم الكمبيوتر ونظرية الألعاب التعلم الآلي أنظمة متعددة العملاء

(Local) Non-Asymptotic Analysis of Logistic Fictitious Play for Two-Player Zero-Sum Games and Its Deterministic Variant

97 - Renbo Zhao , Qiuyun Zhu 2021

We conduct a local non-asymptotic analysis of the logistic fictitious play (LFP) algorithm, and show that with high probability, this algorithm converges locally at rate $O(1/t)$. To achieve this, we first develop a global non-asymptotic analysis of the deterministic variant of LFP, which we call DLFP, and derive a class of convergence rates based on different step-sizes. We then incorporate a particular form of stochastic noise to the analysis of DLFP, and obtain the local convergence rate of LFP. As a result of independent interest, we extend DLFP to solve a class of strongly convex composite optimization problems. We show that although the resulting algorithm is a simple variant of the generalized Frank-Wolfe method in Nesterov [1,Section 5], somewhat surprisingly, it enjoys significantly improved convergence rate.

التحسين والتحكم الاحتمالات