ﻻ يوجد ملخص باللغة العربية
We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(Delta,C,T)$ self-bounding constraint the algorithm achieves $mathcal{O}left(left(sum_{i eq i^*} frac{1}{Delta_i}right)log_+left(frac{(K-1)T}{left(sum_{i eq i^*} frac{1}{Delta_i}right)^2}right)+sqrt{Cleft(sum_{i eq i^*}frac{1}{Delta_i}right)log_+left(frac{(K-1)T}{Csum_{i eq i^*}frac{1}{Delta_i}}right)}right)$ regret bound, where $T$ is the time horizon, $K$ is the number of arms, $Delta_i$ are the suboptimality gaps, $i^*$ is the best arm, $C$ is the corruption magnitude, and $log_+(x) = maxleft(1,log xright)$. The regime includes stochastic bandits, stochastically constrained adversarial bandits, and stochastic bandits with adversarial corruptions as special cases. Additionally, we provide a general analysis, which allows to achieve the same kind of improvement for generalizations of Tsallis-INF to other settings beyond multiarmed bandits.
We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) wit
We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be adversarially changed to trick the algorithm, e.g., click f
We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm o
We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-agent setting, where $V$ agents interact with a common $K$-armed bandit problem, and each pair of agents can communicate with each other to expedite the
We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves $mathcal{O}(sqrt{kn}+sqrt{Dlog(k)