Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

75 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mark Sellke

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Sebastien Bubeck - Thomas Budzinski - Mark Sellke

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $tilde{O}(sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no collisions at all) are achievable for any number of players and arms. At a high level, the previous strategy heavily relied on a $2$-dimensional geometric intuition that was difficult to generalize in higher dimensions, while here we take a more combinatorial route to build the new strategy.

قيم البحث

290 - William Chang , Mehdi Jafarnia-Jahromi , Rahul Jain 2021

We introduce a framework for decentralized online learning for multi-armed bandits (MAB) with multiple cooperative players. The reward obtained by the players in each round depends on the actions taken by all the players. Its a team setting, and the objective is common. Information asymmetry is what makes the problem interesting and challenging. We consider three types of information asymmetry: action information asymmetry when the actions of the players cant be observed but the rewards received are common; reward information asymmetry when the actions of the other players are observable but rewards received are IID from the same distribution; and when we have both action and reward information asymmetry. For the first setting, we propose a UCB-inspired algorithm that achieves $O(log T)$ regret whether the rewards are IID or Markovian. For the second section, we offer an environment such that the algorithm given for the first setting gives linear regret. For the third setting, we show that a variation of the `explore then commit algorithm achieves almost log regret.

التعلم الآلي

Decentralized Heterogeneous Multi-Player Multi-Armed Bandits with Non-Zero Rewards on Collisions

304 - Akshayaa Magesh , Venugopal V. Veeravalli 2019

We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions and rewards. The environment may appear differently to different players, $textit{i.e.}$, the reward distributions for a given arm are heterogeneous across players. In the case of a collision (when more than one player plays the same arm), we allow for the colliding players to receive non-zero rewards. The time-horizon $T$ for which the arms are played is emph{not} known to the players. Within this setup, where the number of players is allowed to be greater than the number of arms, we present a policy that achieves near order-optimal expected regret of order $O(log^{1 + delta} T)$ for some $0 < delta < 1$ over a time-horizon of duration $T$. This paper is currently under review at IEEE Transactions on Information Theory.

التعلم الآلي علوم الكمبيوتر ونظرية الألعاب التعلم الالي

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

651 - Junpei Komiyama , Junya Honda , Hiroshi Nakagawa 2015

We discuss a multiple-play multi-armed bandit (MAB) problem in which several arms are selected at each round. Recently, Thompson sampling (TS), a randomized algorithm with a Bayesian spirit, has attracted much attention for its empirically excellent performance, and it is revealed to have an optimal regret bound in the standard single-play MAB problem. In this paper, we propose the multiple-play Thompson sampling (MP-TS) algorithm, an extension of TS to the multiple-play MAB problem, and discuss its regret analysis. We prove that MP-TS for binary rewards has the optimal regret upper bound that matches the regret lower bound provided by Anantharam et al. (1987). Therefore, MP-TS is the first computationally efficient algorithm with optimal regret. A set of computer simulations was also conducted, which compared MP-TS with state-of-the-art algorithms. We also propose a modification of MP-TS, which is shown to have better empirical performance.

التعلم الالي التعلم الآلي

Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions

77 - Junyan Liu , Shuai Li , Dapeng Li 2021

We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-agent setting, where $V$ agents interact with a common $K$-armed bandit problem, and each pair of agents can communicate with each other to expedite the learning process. In the problem, the rewards are independently sampled from distributions across all agents and rounds, but they may be corrupted by an adversary. Our goal is to minimize both the overall regret and communication cost across all agents. We first show that an additive term of corruption is unavoidable for any algorithm in this problem. Then, we propose a new algorithm that is agnostic to the level of corruption. Our algorithm not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication. The algorithm is also applicable for the single-agent corruption problem, and achieves a high probability regret that removes the multiplicative dependence of $K$ on corruption level. Our result of the single-agent case resolves an open question from Gupta et al. [2019].

التعلم الآلي

Cooperative Multi-Agent Bandits with Heavy Tails

71 - Abhimanyu Dubey , Alex Pentland 2020

We study the heavy-tailed stochastic bandit problem in the cooperative multi-agent setting, where a group of agents interact with a common bandit problem, while communicating on a network with delays. Existing algorithms for the stochastic bandit in this setting utilize confidence intervals arising from an averaging-based communication protocol known as~textit{running consensus}, that does not lend itself to robust estimation for heavy-tailed settings. We propose textsc{MP-UCB}, a decentralized multi-agent algorithm for the cooperative stochastic bandit that incorporates robust estimation with a message-passing protocol. We prove optimal regret bounds for textsc{MP-UCB} for several problem settings, and also demonstrate its superiority to existing methods. Furthermore, we establish the first lower bounds for the cooperative bandit problem, in addition to providing efficient algorithms for robust bandit estimation of location.

التعلم الآلي أنظمة متعددة العملاء التعلم الالي