Iterative Empirical Game Solving via Single Policy Best Response

89 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Max Smith

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Max Olan Smith - Thomas Anthony - Michael P. Wellman

أنظمة متعددة العملاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Policy-Space Response Oracles (PSRO) is a general algorithmic framework for learning policies in multiagent systems by interleaving empirical game analysis with deep reinforcement learning (Deep RL). At each iteration, Deep RL is invoked to train a best response to a mixture of opponent policies. The repeated application of Deep RL poses an expensive computational burden as we look to apply this algorithm to more complex domains. We introduce two variations of PSRO designed to reduce the amount of simulation required during Deep RL training. Both algorithms modify how PSRO adds new policies to the empirical game, based on learned responses to a single opponent policy. The first, Mixed-Oracles, transfers knowledge from previous iterations of Deep RL, requiring training only against the opponents newest policy. The second, Mixed-Opponents, constructs a pure-strategy opponent by mixing existing strategys action-value estimates, instead of their policies. Learning against a single policy mitigates variance in state outcomes that is induced by an unobserved distribution of opponents. We empirically demonstrate that these algorithms substantially reduce the amount of simulation during training required by PSRO, while producing equivalent or better solutions to the game.

قيم البحث

94 - Yongzhao Wang , Qiurui Ma , Michael P. Wellman 2021

In empirical game-theoretic analysis (EGTA), game models are extended iteratively through a process of generating new strategies based on learning from experience with prior strategies. The strategy exploration problem in EGTA is how to direct this p rocess so to construct effective models with minimal iteration. A variety of approaches have been proposed in the literature, including methods based on classic techniques and novel concepts. Comparing the performance of these alternatives can be surprisingly subtle, depending sensitively on criteria adopted and measures employed. We investigate some of the methodological considerations in evaluating strategy exploration, defining key distinctions and identifying a few general principles based on examples and experimental observations. In particular, we emphasize the fact that empirical games create a space of strategies that should be evaluated as a whole. Based on this fact, we suggest that the minimum regret constrained profile (MRCP) provides a particularly robust basis for evaluating a space of strategies, and propose a local search method for MRCP that outperforms previous approaches. However, the computation of MRCP is not always feasible especially in large games. In this scenario, we highlight consistency considerations for comparing across different approaches. Surprisingly, we find that recent works violate these considerations that are necessary for evaluation, which may result in misleading conclusions on the performance of different approaches. For proper evaluation, we propose a new evaluation scheme and demonstrate that our scheme can reveal the true learning performance of different approaches compared to previous evaluation methods.

أنظمة متعددة العملاء

Iterative Best Response for Multi-Body Asset-Guarding Games

136 - Emmanuel Sin , Murat Arcak , Douglas Philbrick 2020

We present a numerical approach to finding optimal trajectories for players in a multi-body, asset-guarding game with nonlinear dynamics and non-convex constraints. Using the Iterative Best Response (IBR) scheme, we solve for each players optimal str ategy assuming the other players trajectories are known and fixed. Leveraging recent advances in Sequential Convex Programming (SCP), we use SCP as a subroutine within the IBR algorithm to efficiently solve an approximation of each players constrained trajectory optimization problem. We apply the approach to an asset-guarding game example involving multiple pursuers and a single evader (i.e., n-versus-1 engagements). Resulting evader trajectories are tested in simulation to verify successful evasion against pursuers using conventional intercept guidance laws.

أنظمة وتحكم أنظمة متعددة العملاء أنظمة وتحكم

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

226 - Thomas Anthony , Tom Eccles , Andrea Tacchetti 2020

Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled applicat ion of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.

التعلم الآلي الذكاء الاصطناعي علوم الكمبيوتر ونظرية الألعاب

Learning Convex Partitions and Computing Game-theoretic Equilibria from Best Response Queries

125 - Paul W. Goldberg , Francisco J. Marmolejo-Cossio 2018

Suppose that an $m$-simplex is partitioned into $n$ convex regions having disjoint interiors and distinct labels, and we may learn the label of any point by querying it. The learning objective is to know, for any point in the simplex, a label that oc curs within some distance $epsilon$ from that point. We present two algorithms for this task: Constant-Dimension Generalised Binary Search (CD-GBS), which for constant $m$ uses $poly(n, log left( frac{1}{epsilon} right))$ queries, and Constant-Region Generalised Binary Search (CR-GBS), which uses CD-GBS as a subroutine and for constant $n$ uses $poly(m, log left( frac{1}{epsilon} right))$ queries. We show via Kakutanis fixed-point theorem that these algorithms provide bounds on the best-response query complexity of computing approximate well-supported equilibria of bimatrix games in which one of the players has a constant number of pure strategies. We also partially extend our results to games with multiple players, establishing further query complexity bounds for computing approximate well-supported equilibria in this setting.

علوم الكمبيوتر ونظرية الألعاب التعلم الآلي

Policy Perturbation via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

137 - Jian Hu , Siyue Hu , Shih-wei Liao 2021

Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which is is not friendly to algorithmic utility. By contrast, we find that MAPPO faces the problem of textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled shared advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to solve the POMAC problem, we propose two novel policy perturbation methods, i.e, Noisy-Value MAPPO (NV-MAPPO) and Noisy-Advantage MAPPO (NA-MAPPO), which disturb the advantage values via random Gaussian noise. The experimental results show that our methods outperform the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at url{https://github.com/hijkzzz/noisy-mappo}.

أنظمة متعددة العملاء

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة العربية الدولية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Iterative Empirical Game Solving via Single Policy Best Response

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً