Decentralized Heterogeneous Multi-Player Multi-Armed Bandits with Non-Zero Rewards on Collisions


الملخص بالإنكليزية

We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions and rewards. The environment may appear differently to different players, $textit{i.e.}$, the reward distributions for a given arm are heterogeneous across players. In the case of a collision (when more than one player plays the same arm), we allow for the colliding players to receive non-zero rewards. The time-horizon $T$ for which the arms are played is emph{not} known to the players. Within this setup, where the number of players is allowed to be greater than the number of arms, we present a policy that achieves near order-optimal expected regret of order $O(log^{1 + delta} T)$ for some $0 < delta < 1$ over a time-horizon of duration $T$. This paper is currently under review at IEEE Transactions on Information Theory.

تحميل البحث