Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions

99 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kei Takemura

تاريخ النشر 2021

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Kei Takemura - Shinji Ito - Daisuke Hatano

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds $T$. However, there is a gap of $tilde{O}(max(sqrt{d}, sqrt{k}))$ between the current best upper and lower bounds, where $d$ is the dimension of the feature vectors, $k$ is the number of the chosen arms in a round, and $tilde{O}(cdot)$ ignores the logarithmic factors. The dependence of $k$ and $d$ is of practical importance because $k$ may be larger than $T$ in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C${}^2$UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound $tilde{O}(dsqrt{kT} + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C${}^2$UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.

قيم البحث

93 - Xi Chen , Yanjun Han , Yining Wang 2021

In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. {The adversarial combinatorial bandit with general non-linear reward is an import ant open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback.} We show that, with $N$ arms and subsets of $K$ arms being chosen at each of $T$ time periods, the minimax optimal regret is $widetildeTheta_{d}(sqrt{N^d T})$ if the reward function is a $d$-degree polynomial with $d< K$, and $Theta_K(sqrt{N^K T})$ if the reward function is not a low-degree polynomial. {Both bounds are significantly different from the bound $O(sqrt{mathrm{poly}(N,K)T})$ for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures.} Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual $binom{N}{K}$ assortment as independent.

التعلم الالي التعلم الآلي

Parallelizing Contextual Linear Bandits

175 - Jeffrey Chan , Aldo Pacchiano , Nilesh Tripuraneni 2021

Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual linear bandit algorithms, whose regret is nearly identical to their perfectly sequential counterparts -- given access to the same total number of oracle queries -- up to a lower-order burn-in term that is dependent on the context-set geometry. We provide matching information-theoretic lower bounds on parallel regret performance to establish our algorithms are asymptotically optimal in the time horizon. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings.

التعلم الالي التعلم الآلي الجزيئات الحيوية

Regret Bounds for Batched Bandits

123 - Hossein Esfandiari , Amin Karbasi , Abbas Mehrabian 2019

We present simple and efficient algorithms for the batched stochastic multi-armed bandit and batched stochastic linear bandit problems. We prove bounds for their expected regrets that improve over the best-known regret bounds for any number of batche s. In particular, our algorithms in both settings achieve the optimal expected regrets by using only a logarithmic number of batches. We also study the batched adversarial multi-armed bandit problem for the first time and find the optimal regret, up to logarithmic factors, of any algorithm with predetermined batch sizes.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks

119 - Qin Ding , Cho-Jui Hsieh , James Sharpnack 2021

Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks an d can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by extensive experiments that our proposed algorithm significantly improves the robustness against various kinds of popular attacks.

التعلم الالي التعلم الآلي

KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

124 - Aurelien Garivier , Hedi Hadiji , Pierre Menard 2018

In the context of K-armed stochastic bandits with distribution only assumed to be supported by [0,1], we introduce the first algorithm, called KL-UCB-switch, that enjoys simultaneously a distribution-free regret bound of optimal order $sqrt{KT}$ and a distribution-dependent regret bound of optimal order as well, that is, matching the $kappaln T$ lower bound by Lai & Robbins (1985) and Burnetas & Katehakis (1996). This self-contained contribution simultaneously presents state-of-the-art techniques for regret minimization in bandit models, and an elementary construction of non-asymptotic confidence bounds based on the empirical likelihood method for bounded distributions.

التعلم الالي التعلم الآلي نظرية الإحصاء