Simple Combinatorial Algorithms for Combinatorial Bandits: Corruptions and Approximations

64 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Haike Xu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Haike Xu - Jian Li

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider the stochastic combinatorial semi-bandit problem with adversarial corruptions. We provide a simple combinatorial algorithm that can achieve a regret of $tilde{O}left(C+d^2K/Delta_{min}right)$ where $C$ is the total amount of corruptions, $d$ is the maximal number of arms one can play in each round, $K$ is the number of arms. If one selects only one arm in each round, we achieves a regret of $tilde{O}left(C+sum_{Delta_i>0}(1/Delta_i)right)$. Our algorithm is combinatorial and improves on the previous combinatorial algorithm by [Gupta et al., COLT2019] (their bound is $tilde{O}left(KC+sum_{Delta_i>0}(1/Delta_i)right)$), and almost matches the best known bounds obtained by [Zimmert et al., ICML2019] and [Zimmert and Seldin, AISTATS2019] (up to logarithmic factor). Note that the algorithms in [Zimmert et al., ICML2019] and [Zimmert and Seldin, AISTATS2019] require one to solve complex convex programs while our algorithm is combinatorial, very easy to implement, requires weaker assumptions and has very low oracle complexity and running time. We also study the setting where we only get access to an approximation oracle for the stochastic combinatorial semi-bandit problem. Our algorithm achieves an (approximation) regret bound of $tilde{O}left(dsqrt{KT}right)$. Our algorithm is very simple, only worse than the best known regret bound by $sqrt{d}$, and has much lower oracle complexity than previous work.

قيم البحث

173 - Karthik Abinav Sankararaman , Aleksandrs Slivkins 2017

We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited resources consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a hu ge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, support it with several motivating examples, and design an algorithm for it. Our regret bounds are comparable with those for BwK and combinatorial semi- bandits.

التعلم الآلي

Combinatorial Bandits without Total Order for Arms

74 - Shuo Yang , Tongzheng Ren , Inderjit S. Dhillon 2021

We consider the combinatorial bandits problem, where at each time step, the online learner selects a size-$k$ subset $s$ from the arms set $mathcal{A}$, where $left|mathcal{A}right| = n$, and observes a stochastic reward of each arm in the selected s et $s$. The goal of the online learner is to minimize the regret, induced by not selecting $s^*$ which maximizes the expected total reward. Specifically, we focus on a challenging setting where 1) the reward distribution of an arm depends on the set $s$ it is part of, and crucially 2) there is textit{no total order} for the arms in $mathcal{A}$. In this paper, we formally present a reward model that captures set-dependent reward distribution and assumes no total order for arms. Correspondingly, we propose an Upper Confidence Bound (UCB) algorithm that maintains UCB for each individual arm and selects the arms with top-$k$ UCB. We develop a novel regret analysis and show an $Oleft(frac{k^2 n log T}{epsilon}right)$ gap-dependent regret bound as well as an $Oleft(k^2sqrt{n T log T}right)$ gap-independent regret bound. We also provide a lower bound for the proposed reward model, which shows our proposed algorithm is near-optimal for any constant $k$. Empirical results on various reward models demonstrate the broad applicability of our algorithm.

التعلم الآلي

(Locally) Differentially Private Combinatorial Semi-Bandits

84 - Xiaoyu Chen , Kai Zheng , Zixin Zhou 2020

In this paper, we study Combinatorial Semi-Bandits (CSB) that is an extension of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and stronger Local Differential Privacy (LDP) setting. Since the server receives more information from users in CSB, it usually causes additional dependence on the dimension of data, which is a notorious side-effect for privacy preserving learning. However for CSB under two common smoothness assumptions cite{kveton2015tight,chen2016combinatorial}, we show it is possible to remove this side-effect. In detail, for $B_{infty}$-bounded smooth CSB under either $varepsilon$-LDP or $varepsilon$-DP, we prove the optimal regret bound is $Theta(frac{mB^2_{infty}ln T } {Deltaepsilon^2})$ or $tilde{Theta}(frac{mB^2_{infty}ln T} { Deltaepsilon})$ respectively, where $T$ is time period, $Delta$ is the gap of rewards and $m$ is the number of base arms, by proposing novel algorithms and matching lower bounds. For $B_1$-bounded smooth CSB under $varepsilon$-DP, we also prove the optimal regret bound is $tilde{Theta}(frac{mKB^2_1ln T} {Deltaepsilon})$ with both upper bound and lower bound, where $K$ is the maximum number of feedback in each round. All above results nearly match corresponding non-private optimal rates, which imply there is no additional price for (locally) differentially private CSB in above common settings.

التعلم الآلي التعلم الالي

Combinatorial Bandits for Incentivizing Agents with Dynamic Preferences

135 - Tanner Fiez , Shreyas Sekar , Liyuan Zheng 2018

The design of personalized incentives or recommendations to improve user engagement is gaining prominence as digital platform providers continually emerge. We propose a multi-armed bandit framework for matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. We design an algorithm that combines ideas from three distinct domains: (i) a greedy matching paradigm, (ii) the upper confidence bound algorithm (UCB) for bandits, and (iii) mixing times from the theory of Markov chains. For this algorithm, we provide theoretical bounds on the regret and demonstrate its performance via both synthetic and realistic (matching supply and demand in a bike-sharing platform) examples.

التعلم الآلي الذكاء الاصطناعي أنظمة وتحكم

Efficient Learning in Large-Scale Combinatorial Semi-Bandits

251 - Zheng Wen , Branislav Kveton , 2014

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.

التعلم الآلي الذكاء الاصطناعي التعلم الالي