No Arabic abstract
We consider a stochastic bandit problem with a possibly infinite number of arms. We write $p^*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters $T$ (the budget), $p^*$ and $Delta$. For the objective of minimizing the cumulative regret, we provide a lower bound of order $Omega(log(T)/(p^*Delta))$ and a UCB-style algorithm with matching upper bound up to a factor of $log(1/Delta)$. Our algorithm needs $p^*$ to calibrate its parameters, and we prove that this knowledge is necessary, since adapting to $p^*$ in this setting is impossible. For best-arm identification we also provide a lower bound of order $Omega(exp(-cTDelta^2p^*))$ on the probability of outputting a sub-optimal arm where $c>0$ is an absolute constant. We also provide an elimination algorithm with an upper bound matching the lower bound up to a factor of order $log(1/Delta)$ in the exponential, and that does not need $p^*$ or $Delta$ as parameter.
We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm $i$ not only reveals information about its own reward distribution, but also about all those arms that share the same cluster with arm $i$. This correlation amongst the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as $O(Klog T)$, where $K$ is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that is optimal for the classical MABP and does not utilize these dependencies, the regret scales as $O(Mlog T)$ where $M$ is the number of arms.
We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves $mathcal{O}(sqrt{kn}+sqrt{Dlog(k)})$ regret guarantee, where $k$ is the number of arms, $n$ is the number of rounds, and $D$ is the total delay. The result matches the lower bound within constants and requires no prior knowledge of $n$ or $D$. Additionally, we propose a refined tuning of the algorithm, which achieves $mathcal{O}(sqrt{kn}+min_{S}|S|+sqrt{D_{bar S}log(k)})$ regret guarantee, where $S$ is a set of rounds excluded from delay counting, $bar S = [n]setminus S$ are the counted rounds, and $D_{bar S}$ is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019).
We study the structure of regret-minimizing policies in the many-armed Bayesian multi-armed bandit problem: in particular, with k the number of arms and T the time horizon, we consider the case where k > sqrt{T}. We first show that subsampling is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in this regime. However, a subsampled UCB (SS-UCB), which samples sqrt{T} arms and executes UCB only on that subset, is rate-optimal. Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms. In fact, in numerical experiments SS-UCB performs worse than a simple greedy algorithm (and its subsampled version) that pulls the current empirical best arm at every time period. We show that these insights hold even in a contextual setting, using real-world data. These empirical results suggest a novel form of free exploration in the many-armed regime that benefits greedy algorithms. We theoretically study this new source of free exploration and find that it is deeply connected to the distribution of a certain tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration as discussed in the recent literature on contextual bandits, where free exploration arises due to variation in contexts. We prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when k > sqrt{T}, and achieves sublinear regret with more general distributions. This is a case where theoretical rate optimality does not tell the whole story: when complemented by the empirical observations of our paper, the power of greedy algorithms becomes quite evident. Taken together, from a practical standpoint, our results suggest that in applications it may be preferable to use a variant of the greedy algorithm in the many-armed regime.
Cascading bandit (CB) is a popular model for web search and online advertising, where an agent aims to learn the $K$ most attractive items out of a ground set of size $L$ during the interaction with a user. However, the stationary CB model may be too simple to apply to real-world problems, where user preferences may change over time. Considering piecewise-stationary environments, two efficient algorithms, texttt{GLRT-CascadeUCB} and texttt{GLRT-CascadeKL-UCB}, are developed and shown to ensure regret upper bounds on the order of $mathcal{O}(sqrt{NLTlog{T}})$, where $N$ is the number of piecewise-stationary segments, and $T$ is the number of time slots. At the crux of the proposed algorithms is an almost parameter-free change-point detector, the generalized likelihood ratio test (GLRT). Comparing with existing works, the GLRT-based algorithms: i) are free of change-point-dependent information for choosing parameters; ii) have fewer tuning parameters; iii) improve at least the $L$ dependence in regret upper bounds. In addition, we show that the proposed algorithms are optimal (up to a logarithm factor) in terms of regret by deriving a minimax lower bound on the order of $Omega(sqrt{NLT})$ for piecewise-stationary CB. The efficiency of the proposed algorithms relative to state-of-the-art approaches is validated through numerical experiments on both synthetic and real-world datasets.
We study the problem of estimating the expected reward of the optimal policy in the stochastic disjoint linear bandit setting. We prove that for certain settings it is possible to obtain an accurate estimate of the optimal policy value even with a number of samples that is sublinear in the number that would be required to emph{find} a policy that realizes a value close to this optima. We establish nearly matching information theoretic lower bounds, showing that our algorithm achieves near optimal estimation error. Finally, we demonstrate the effectiveness of our algorithm on joke recommendation and cancer inhibition dosage selection problems using real datasets.