No Arabic abstract
We consider stochastic bandit problems with $K$ arms, each associated with a bounded distribution supported on the range $[m,M]$. We do not assume that the range $[m,M]$ is known and show that there is a cost for learning this range. Indeed, a new trade-off between distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical $ln T$ and smash{$sqrt{T}$} bounds. For instance, a smash{$sqrt{T}$} distribution-free regret bound may only be achieved if the distribution-dependent regret bounds are at least of order smash{$sqrt{T}$}. We exhibit a strategy achieving the rates for regret indicated by the new trade-off.
We consider a stochastic bandit problem with countably many arms that belong to a finite set of types, each characterized by a unique mean reward. In addition, there is a fixed distribution over types which sets the proportion of each type in the population of arms. The decision maker is oblivious to the type of any arm and to the aforementioned distribution over types, but perfectly knows the total number of types occurring in the population of arms. We propose a fully adaptive online learning algorithm that achieves O(log n) distribution-dependent expected cumulative regret after any number of plays n, and show that this order of regret is best possible. The analysis of our algorithm relies on newly discovered concentration and convergence properties of optimism-based policies like UCB in finite-armed bandit problems with zero gap, which may be of independent interest.
We introduce a new class of reinforcement learning methods referred to as {em episodic multi-armed bandits} (eMAB). In eMAB the learner proceeds in {em episodes}, each composed of several {em steps}, in which it chooses an action and observes a feedback signal. Moreover, in each step, it can take a special action, called the $stop$ action, that ends the current episode. After the $stop$ action is taken, the learner collects a terminal reward, and observes the costs and terminal rewards associated with each step of the episode. The goal of the learner is to maximize its cumulative gain (i.e., the terminal reward minus costs) over all episodes by learning to choose the best sequence of actions based on the feedback. First, we define an {em oracle} benchmark, which sequentially selects the actions that maximize the expected immediate gain. Then, we propose our online learning algorithm, named {em FeedBack Adaptive Learning} (FeedBAL), and prove that its regret with respect to the benchmark is bounded with high probability and increases logarithmically in expectation. Moreover, the regret only has polynomial dependence on the number of steps, actions and states. eMAB can be used to model applications that involve humans in the loop, ranging from personalized medical screening to personalized web-based education, where sequences of actions are taken in each episode, and optimal behavior requires adapting the chosen actions based on the feedback.
We consider the bandit-based framework for diversity-preserving recommendations introduced by Celis et al. (2019), who approached it mainly by a reduction to the setting of linear bandits. We design a UCB algorithm using the specific structure of the setting and show that it enjoys a bounded distribution-dependent regret in the natural cases when the optimal mixed actions put some probability mass on all actions (i.e., when diversity is desirable). Simulations illustrate this fact. We also provide regret lower bounds and briefly discuss distribution-free regret bounds.
We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm $i$ not only reveals information about its own reward distribution, but also about all those arms that share the same cluster with arm $i$. This correlation amongst the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as $O(Klog T)$, where $K$ is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that is optimal for the classical MABP and does not utilize these dependencies, the regret scales as $O(Mlog T)$ where $M$ is the number of arms.
We investigate whether in a distributed setting, adaptive estimation of a smooth function at the optimal rate is possible under minimal communication. It turns out that the answer depends on the risk considered and on the number of servers over which the procedure is distributed. We show that for the $L_infty$-risk, adaptively obtaining optimal rates under minimal communication is not possible. For the $L_2$-risk, it is possible over a range of regularities that depends on the relation between the number of local servers and the total sample size.