ﻻ يوجد ملخص باللغة العربية
Consider a player that in each round $t$ out of $T$ rounds chooses an action and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that even if the players algorithms lose their no regret property due to too large delays, the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have no discounted-regret. For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves a regret of $Oleft(nT^{frac{3}{4}}+sqrt{n}T^{frac{1}{3}}D^{frac{1}{3}}right)$ and the EXP3 algorithm with $K$ arms achieves a regret of $Oleft(sqrt{ln Kleft(KT+Dright)}right)$ even when $D=sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that EXP3 and FKM have no discounted-regret even for $d_{t}=Oleft(tlog tright)$. Therefore, the CCE of a finite or convex unknown game can be approximated even when only delayed bandit feedback is available via simulation.
We study online learning in repeated first-price auctions with censored feedback, where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, th
We study adversarial scaling, a multi-armed bandit model where rewards have a stochastic and an adversarial component. Our model captures display advertising where the click-through-rate can be decomposed to a (fixed across time) arm-quality componen
We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves $mathcal{O}(sqrt{kn}+sqrt{Dlog(k)
This paper investigates the adversarial Bandits with Knapsack (BwK) online learning problem, where a player repeatedly chooses to perform an action, pays the corresponding cost, and receives a reward associated with the action. The player is constrai
We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that delayed Exp3 achieves the $O(sqrt{(KT + D)ln K} )$ regret bound conjectured by Cesa-Bianchi et al. [2019] in the case