ترغب بنشر مسار تعليمي؟ اضغط هنا

Lets be Honest: An Optimal No-Regret Framework for Zero-Sum Games

143   0   0.0 ( 0 )
 نشر من قبل Mehmet Fatih \\c{S}ahin
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We revisit the problem of solving two-player zero-sum games in the decentralized setting. We propose a simple algorithmic framework that simultaneously achieves the best rates for honest regret as well as adversarial regret, and in addition resolves the open problem of removing the logarithmic terms in convergence to the value of the game. We achieve this goal in three steps. First, we provide a novel analysis of the optimistic mirror descent (OMD), showing that it can be modified to guarantee fast convergence for both honest regret and value of the game, when the players are playing collaboratively. Second, we propose a new algorithm, dubbed as robust optimistic mirror descent (ROMD), which attains optimal adversarial regret without knowing the time horizon beforehand. Finally, we propose a simple signaling scheme, which enables us to bridge OMD and ROMD to achieve the best of both worlds. Numerical examples are presented to support our theoretical claims and show that our non-adaptive ROMD algorithm can be competitive to OMD with adaptive step-size selection.



قيم البحث

اقرأ أيضاً

In this paper, we consider a distributed learning problem in a subnetwork zero-sum game, where agents are competing in different subnetworks. These agents are connected through time-varying graphs where each agent has its own cost function and can re ceive information from its neighbors. We propose a distributed mirror descent algorithm for computing a Nash equilibrium and establish a sublinear regret bound on the sequence of iterates when the graphs are uniformly strongly connected and the cost functions are convex-concave. Moreover, we prove its convergence with suitably selected diminishing stepsizes for a strictly convex-concave cost function. We also consider a constant step-size variant of the algorithm and establish an asymptotic error bound between the cost function values of running average actions and a Nash equilibrium. In addition, we apply the algorithm to compute a mixed-strategy Nash equilibrium in subnetwork zero-sum finite-strategy games, which have merely convex-concave (to be specific, multilinear) cost functions, and obtain a final-iteration convergence result and an ergodic convergence result, respectively, under different assumptions.
208 - Yuanyuan Shi , Baosen Zhang 2019
This paper examines the convergence of no-regret learning in Cournot games with continuous actions. Cournot games are the essential model for many socio-economic systems, where players compete by strategically setting their output quantity. We assume that players do not have full information of the game and thus cannot pre-compute a Nash equilibrium. Two types of feedback are considered: one is bandit feedback and the other is gradient feedback. To study the convergence of the induced sequence of play, we introduce the notion of convergence in measure, and show that the players actual sequence of action converges to the unique Nash equilibrium. In addition, our results naturally extend the no-regret learning algorithms time-average regret bounds to obtain the final-iteration convergence rates. Together, our work presents significantly sharper convergence results for learning in games without strong assumptions on game property (e.g., monotonicity) and shows how exploiting the game information feedback can influence the convergence rates.
We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfect-information zero-sum extensive-form game. Team members are not allowed to communicate during play but can coordinate before the ga me. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a different literature on extensive-form correlation. Second, we provide an algorithm that computes such an optimal distribution by only using profiles where only one of the team members gets to randomize in each profile. We can also cap the number of such profiles we allow in the solution. This begets an anytime algorithm by increasing the cap. We find that often a handful of well-chosen such profiles suffices to reach optimal utility for the team. This enables team members to reach coordination through a relatively simple and understandable plan. Finally, inspired by this observation and leveraging theoretical concepts that we introduce, we develop an efficient column-generation algorithm for finding an optimal distribution for the team. We evaluate it on a suite of common benchmark games. It is three orders of magnitude faster than the prior state of the art on games that the latter can solve and it can also solve several games that were previously unsolvable.
Counterfactual Regret Minimization (CFR) is an efficient no-regret learning algorithm for decision problems modeled as extensive games. CFRs regret bounds depend on the requirement of perfect recall: players always remember information that was revea led to them and the order in which it was revealed. In games without perfect recall, however, CFRs guarantees do not apply. In this paper, we present the first regret bound for CFR when applied to a general class of games with imperfect recall. In addition, we show that CFR applied to any abstraction belonging to our general class results in a regret bound not just for the abstract game, but for the full game as well. We verify our theory and show how imperfect recall can be used to trade a small increase in regret for a significant reduction in memory in three domains: die-roll poker, phantom tic-tac-toe, and Bluff.
We present fictitious play dynamics for stochastic games and analyze its convergence properties in zero-sum stochastic games. Our dynamics involves players forming beliefs on opponent strategy and their own continuation payoff (Q-function), and playi ng a greedy best response using estimated continuation payoffs. Players update their beliefs from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of player payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا