No Arabic abstract
We address the M-best-arm identification problem in multi-armed bandits. A player has a limited budget to explore K arms (M<K), and once pulled, each arm yields a reward drawn (independently) from a fixed, unknown distribution. The goal is to find the top M arms in the sense of expected reward. We develop an algorithm which proceeds in rounds to deactivate arms iteratively. At each round, the budget is divided by a nonlinear function of remaining arms, and the arms are pulled correspondingly. Based on a decision rule, the deactivated arm at each round may be accepted or rejected. The algorithm outputs the accepted arms that should ideally be the top M arms. We characterize the decay rate of the misidentification probability and establish that the nonlinear budget allocation proves to be useful for different problem environments (described by the number of competitive arms). We provide comprehensive numerical experiments showing that our algorithm outperforms the state-of-the-art using suitable nonlinearity.
We consider the best-arm identification problem in multi-armed bandits, which focuses purely on exploration. A player is given a fixed budget to explore a finite set of arms, and the rewards of each arm are drawn independently from a fixed, unknown distribution. The player aims to identify the arm with the largest expected reward. We propose a general framework to unify sequential elimination algorithms, where the arms are dismissed iteratively until a unique arm is left. Our analysis reveals a novel performance measure expressed in terms of the sampling mechanism and number of eliminated arms at each round. Based on this result, we develop an algorithm that divides the budget according to a nonlinear function of remaining arms at each round. We provide theoretical guarantees for the algorithm, characterizing the suitable nonlinearity for different problem environments described by the number of competitive arms. Matching the theoretical results, our experiments show that the nonlinear algorithm outperforms the state-of-the-art. We finally study the side-observation model, where pulling an arm reveals the rewards of its related arms, and we establish improved theoretical guarantees in the pure-exploration setting.
In this paper we introduce the transductive linear bandit problem: given a set of measurement vectors $mathcal{X}subset mathbb{R}^d$, a set of items $mathcal{Z}subset mathbb{R}^d$, a fixed confidence $delta$, and an unknown vector $theta^{ast}in mathbb{R}^d$, the goal is to infer $text{argmax}_{zin mathcal{Z}} z^toptheta^ast$ with probability $1-delta$ by making as few sequentially chosen noisy measurements of the form $x^toptheta^{ast}$ as possible. When $mathcal{X}=mathcal{Z}$, this setting generalizes linear bandits, and when $mathcal{X}$ is the standard basis vectors and $mathcal{Z}subset {0,1}^d$, combinatorial bandits. Such a transductive setting naturally arises when the set of measurement vectors is limited due to factors such as availability or cost. As an example, in drug discovery the compounds and dosages $mathcal{X}$ a practitioner may be willing to evaluate in the lab in vitro due to cost or safety reasons may differ vastly from those compounds and dosages $mathcal{Z}$ that can be safely administered to patients in vivo. Alternatively, in recommender systems for books, the set of books $mathcal{X}$ a user is queried about may be restricted to well known best-sellers even though the goal might be to recommend more esoteric titles $mathcal{Z}$. In this paper, we provide instance-dependent lower bounds for the transductive setting, an algorithm that matches these up to logarithmic factors, and an evaluation. In particular, we provide the first non-asymptotic algorithm for linear bandits that nearly achieves the information theoretic lower bound.
We study reward maximisation in a wide class of structured stochastic multi-armed bandit problems, where the mean rewards of arms satisfy some given structural constraints, e.g. linear, unimodal, sparse, etc. Our aim is to develop methods that are flexible (in that they easily adapt to different structures), powerful (in that they perform well empirically and/or provably match instance-dependent lower bounds) and efficient in that the per-round computational burden is small. We develop asymptotically optimal algorithms from instance-dependent lower-bounds using iterative saddle-point solvers. Our approach generalises recent iterative methods for pure exploration to reward maximisation, where a major challenge arises from the estimation of the sub-optimality gaps and their reciprocals. Still we manage to achieve all the above desiderata. Notably, our technique avoids the computational cost of the full-blown saddle point oracle employed by previous work, while at the same time enabling finite-time regret bounds. Our experiments reveal that our method successfully leverages the structural assumptions, while its regret is at worst comparable to that of vanilla UCB.
In this work, we study sequential choice bandits with feedback. We propose bandit algorithms for a platform that personalizes users experience to maximize its rewards. For each action directed to a given user, the platform is given a positive reward, which is a non-decreasing function of the action, if this action is below the users threshold. Users are equipped with a patience budget, and actions that are above the threshold decrease the users patience. When all patience is lost, the user abandons the platform. The platform attempts to learn the thresholds of the users in order to maximize its rewards, based on two different feedback models describing the information pattern available to the platform at each action. We define a notion of regret by determining the best action to be taken when the platform knows that the users threshold is in a given interval. We then propose bandit algorithms for the two feedback models and show that upper and lower bounds on the regret are of the order of $tilde{O}(N^{2/3})$ and $tildeOmega(N^{2/3})$, respectively, where $N$ is the total number of users. Finally, we show that the waiting time of any user before receiving a personalized experience is uniform in $N$.
Classic contextual bandit algorithms for linear models, such as LinUCB, assume that the reward distribution for an arm is modeled by a stationary linear regression. When the linear regression model is non-stationary over time, the regret of LinUCB can scale linearly with time. In this paper, we propose a novel multiscale changepoint detection method for the non-stationary linear bandit problems, called Multiscale-LinUCB, which actively adapts to the changing environment. We also provide theoretical analysis of regret bound for Multiscale-LinUCB algorithm. Experimental results show that our proposed Multiscale-LinUCB algorithm outperforms other state-of-the-art algorithms in non-stationary contextual environments.