ترغب بنشر مسار تعليمي؟ اضغط هنا

An Algorithm for Stochastic and Adversarial Bandits with Switching Costs

188   0   0.0 ( 0 )
 نشر من قبل Chlo\\'e Rouyer
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of $Obig((lambda K)^{1/3}T^{2/3} + sqrt{KT}big)$, where $T$ is the time horizon and $K$ is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of $Oleft(big((lambda K)^{2/3} T^{1/3} + ln Tbig)sum_{i eq i^*} Delta_i^{-1}right)$, where $Delta_i$ are the suboptimality gaps and $i^*$ is a unique optimal arm. In the special case of $lambda = 0$ (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.

قيم البحث

اقرأ أيضاً

We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) wit h Tsallis entropy regularization with power $alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $alphain[0,1]$ and explain the reason why $alpha=1/2$ works best.
We study the adversarial multi-armed bandit problem where partial observations are available and where, in addition to the loss incurred for each action, a emph{switching cost} is incurred for shifting to a new action. All previously known results in cur a factor proportional to the independence number of the feedback graph. We give a new algorithm whose regret guarantee depends only on the domination number of the graph. We further supplement that result with a lower bound. Finally, we also give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.
We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves $mathcal{O}(sqrt{kn}+sqrt{Dlog(k) })$ regret guarantee, where $k$ is the number of arms, $n$ is the number of rounds, and $D$ is the total delay. The result matches the lower bound within constants and requires no prior knowledge of $n$ or $D$. Additionally, we propose a refined tuning of the algorithm, which achieves $mathcal{O}(sqrt{kn}+min_{S}|S|+sqrt{D_{bar S}log(k)})$ regret guarantee, where $S$ is a set of rounds excluded from delay counting, $bar S = [n]setminus S$ are the counted rounds, and $D_{bar S}$ is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019).
We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(Delta,C,T)$ self-bounding constraint the algorithm achieves $mathcal{O}left(left(sum_{i eq i^*} frac{1}{Delta_i} right)log_+left(frac{(K-1)T}{left(sum_{i eq i^*} frac{1}{Delta_i}right)^2}right)+sqrt{Cleft(sum_{i eq i^*}frac{1}{Delta_i}right)log_+left(frac{(K-1)T}{Csum_{i eq i^*}frac{1}{Delta_i}}right)}right)$ regret bound, where $T$ is the time horizon, $K$ is the number of arms, $Delta_i$ are the suboptimality gaps, $i^*$ is the best arm, $C$ is the corruption magnitude, and $log_+(x) = maxleft(1,log xright)$. The regime includes stochastic bandits, stochastically constrained adversarial bandits, and stochastic bandits with adversarial corruptions as special cases. Additionally, we provide a general analysis, which allows to achieve the same kind of improvement for generalizations of Tsallis-INF to other settings beyond multiarmed bandits.
We develop the first general semi-bandit algorithm that simultaneously achieves $mathcal{O}(log T)$ regret for stochastic environments and $mathcal{O}(sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of roun ds $T$. The leading problem-dependent constants of our bounds are not only optimal in some worst-case sense studied previously, but also optimal for two concrete instances of semi-bandit problems. Our algorithm and analysis extend the recent work of (Zimmert & Seldin, 2019) for the special case of multi-armed bandit, but importantly requires a novel hybrid regularizer designed specifically for semi-bandit. Experimental results on synthetic data show that our algorithm indeed performs well uniformly over different environments. We finally provide a preliminary extension of our results to the full bandit feedback.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا