The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Khashayar Khosravi

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Mohsen Bayati - Nima Hamidi - Ramesh Johari

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the structure of regret-minimizing policies in the many-armed Bayesian multi-armed bandit problem: in particular, with k the number of arms and T the time horizon, we consider the case where k > sqrt{T}. We first show that subsampling is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in this regime. However, a subsampled UCB (SS-UCB), which samples sqrt{T} arms and executes UCB only on that subset, is rate-optimal. Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms. In fact, in numerical experiments SS-UCB performs worse than a simple greedy algorithm (and its subsampled version) that pulls the current empirical best arm at every time period. We show that these insights hold even in a contextual setting, using real-world data. These empirical results suggest a novel form of free exploration in the many-armed regime that benefits greedy algorithms. We theoretically study this new source of free exploration and find that it is deeply connected to the distribution of a certain tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration as discussed in the recent literature on contextual bandits, where free exploration arises due to variation in contexts. We prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when k > sqrt{T}, and achieves sublinear regret with more general distributions. This is a case where theoretical rate optimality does not tell the whole story: when complemented by the empirical observations of our paper, the power of greedy algorithms becomes quite evident. Taken together, from a practical standpoint, our results suggest that in applications it may be preferable to use a variant of the greedy algorithm in the many-armed regime.

قيم البحث

75 - Anand Kalvit , Assaf Zeevi 2021

One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is amon g the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O(log n) regret in instances with large gaps, and a near-optimal O(sqrt{n log n}) minimax regret when the gap can be arbitrarily small. This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity. This discovery facilitates new sharp asymptotics and a novel alternative proof for the O(sqrt{n log n}) minimax regret of UCB. Furthermore, the paper also provides the first complete process-level characterization of the MAB problem under UCB in the conventional diffusion scaling. Among other things, the small gap worst-case lens adopted in this paper also reveals profound distinctions between the behavior of UCB and Thompson Sampling, such as an incomplete learning phenomenon characteristic of the latter.

التعلم الآلي التعلم الالي

Multi-Armed Bandits with Dependent Arms

370 - Rahul Singh , Fang Liu , Yin Sun 2020

We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm $i$ not only reveals information about its own reward distribution, but also about all those arms that share the same cluster with arm $i$. This correlation amongst the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as $O(Klog T)$, where $K$ is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that is optimal for the classical MABP and does not utilize these dependencies, the regret scales as $O(Mlog T)$ where $M$ is the number of arms.

التعلم الآلي نظرية الإحصاء نظرية الإحصاء

Building Bridges: Viewing Active Learning from the Multi-Armed Bandit Lens

319 - Ravi Ganti , Alexander G. Gray 2013

In this paper we propose a multi-armed bandit inspired, pool based active learning algorithm for the problem of binary classification. By carefully constructing an analogy between active learning and multi-armed bandits, we utilize ideas such as lowe r confidence bounds, and self-concordant regularization from the multi-armed bandit literature to design our proposed algorithm. Our algorithm is a sequential algorithm, which in each round assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for the label of this sampled point. The design of this sampling distribution is also inspired by the analogy between active learning and multi-armed bandits. We show how to derive lower confidence bounds required by our algorithm. Experimental comparisons to previously proposed active learning algorithms show superior performance on some standard UCI datasets.

التعلم الآلي التعلم الالي

Bandits with many optimal arms

75 - Rianne de Heide , James Cheshire , Pierre Menard 2021

We consider a stochastic bandit problem with a possibly infinite number of arms. We write $p^*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates b oth in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters $T$ (the budget), $p^*$ and $Delta$. For the objective of minimizing the cumulative regret, we provide a lower bound of order $Omega(log(T)/(p^*Delta))$ and a UCB-style algorithm with matching upper bound up to a factor of $log(1/Delta)$. Our algorithm needs $p^*$ to calibrate its parameters, and we prove that this knowledge is necessary, since adapting to $p^*$ in this setting is impossible. For best-arm identification we also provide a lower bound of order $Omega(exp(-cTDelta^2p^*))$ on the probability of outputting a sub-optimal arm where $c>0$ is an absolute constant. We also provide an elimination algorithm with an upper bound matching the lower bound up to a factor of order $log(1/Delta)$ in the exponential, and that does not need $p^*$ or $Delta$ as parameter.

التعلم الآلي التعلم الالي

Efficient Algorithms for Finite Horizon and Streaming Restless Multi-Armed Bandit Problems

114 - Aditya Mate , Arpita Biswas , Christoph Siebenbrunner 2021

Restless Multi-Armed Bandits (RMABs) have been popularly used to model limited resource allocation problems. Recently, these have been employed for health monitoring and intervention planning problems. However, the existing approaches fail to account for the arrival of new patients and the departure of enrolled patients from a treatment program. To address this challenge, we formulate a streaming bandit (S-RMAB) framework, a generalization of RMABs where heterogeneous arms arrive and leave under possibly random streams. We propose a new and scalable approach to computing index-based solutions. We start by proving that index values decrease for short residual lifetimes, a phenomenon that we call index decay. We then provide algorithms designed to capture index decay without having to solve the costly finite horizon problem, thereby lowering the computational complexity compared to existing methods.We evaluate our approach via simulations run on real-world data obtained from a tuberculosis intervention planning task as well as multiple other synthetic domains. Our algorithms achieve an over 150x speed-up over existing methods in these tasks without loss in performance. These findings are robust across multiple domains.

التعلم الآلي الذكاء الاصطناعي