ترغب بنشر مسار تعليمي؟ اضغط هنا

Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality

172   0   0.0 ( 0 )
 نشر من قبل Stefanos Leonardos Mr.
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The interplay between exploration and exploitation in competitive multi-agent learning is still far from being well understood. Motivated by this, we study smooth Q-learning, a prototypical learning model that explicitly captures the balance between game rewards and exploration costs. We show that Q-learning always converges to the unique quantal-response equilibrium (QRE), the standard solution concept for games under bounded rationality, in weighted zero-sum polymatrix games with heterogeneous learning agents using positive exploration rates. Complementing recent results about convergence in weighted potential games, we show that fast convergence of Q-learning in competitive settings is obtained regardless of the number of agents and without any need for parameter fine-tuning. As showcased by our experiments in network zero-sum games, these theoretical results provide the necessary guarantees for an algorithmic approach to the currently open problem of equilibrium selection in competitive multi-agent settings.



قيم البحث

اقرأ أيضاً

Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a greedy bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some free users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the competition vs. innovation relationship, and elucidate the first-mover advantage in the digital economy.
Training a multi-agent reinforcement learning (MARL) model is generally difficult because there are numerous combinations of complex interactions among agents that induce certain reward signals. Especially when there is a sparse reward signal, the tr aining becomes more difficult. Previous studies have tried to resolve this issue by employing an intrinsic reward, which is a signal specifically designed for inducing the interactions among agents, to boost the MARL model training. However, this approach requires extensive prior knowledge to design an intrinsic reward. To optimize the training of an MARL model, we propose a learning-based exploration strategy to generate the initial states of a game. The proposed method adopts a variational graph autoencoder to represent a state of a game such that (1) the state can be compactly encoded to the latent representation by considering the relationship among agents, and (2) the latent representation can be used as an effective input to the surrogate model predicting the exploration score. The proposed method determines the latent representations that maximize the surrogate model and decodes these representations to generate the initial states from which the MARL model starts training. Empirically, we demonstrate that the generated states improve the training and performance of MARL more than the existing exploration methods.
High sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty ca n significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.
We empirically study the interplay between exploration and competition. Systems that learn from interactions with users often engage in exploration: making potentially suboptimal decisions in order to acquire new information for future decisions. How ever, when multiple systems are competing for the same market of users, exploration may hurt a systems reputation in the near term, with adverse competitive effects. In particular, a system may enter a death spiral, when the short-term reputation cost decreases the number of users for the system to learn from, which degrades its performance relative to competition and further decreases its market share. We ask whether better exploration algorithms are incentivized under competition. We run extensive numerical experiments in a stylized duopoly model in which two firms deploy multi-armed bandit algorithms and compete for myopic users. We find that duopoly and monopoly tend to favor a primitive greedy algorithm that does not explore and leads to low consumer welfare, whereas a temporary monopoly (a duopoly with an early entrant) may incentivize better bandit algorithms and lead to higher consumer welfare. Our findings shed light on the first-mover advantage in the digital economy by exploring the role that data can play as a barrier to entry in online markets.
VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this problem can be overcome by improving the joint exploration of all agents during training. Specifically, we propose a novel MARL approach called Universal Value Exploration (UneVEn) that learns a set of related tasks simultaneously with a linear decomposition of universal successor features. With the policies of already solved related tasks, the joint exploration process of all agents can be improved to help them achieve better coordination. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا