ترغب بنشر مسار تعليمي؟ اضغط هنا

Multi-Armed Bandits with Censored Consumption of Resources

67   0   0.0 ( 0 )
 نشر من قبل Viktor Bengs
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We derive the theoretical lower bound on the cumulative regret and propose a learning algorithm having a regret upper bound that matches the lower bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms.



قيم البحث

اقرأ أيضاً

We introduce a new class of reinforcement learning methods referred to as {em episodic multi-armed bandits} (eMAB). In eMAB the learner proceeds in {em episodes}, each composed of several {em steps}, in which it chooses an action and observes a feedb ack signal. Moreover, in each step, it can take a special action, called the $stop$ action, that ends the current episode. After the $stop$ action is taken, the learner collects a terminal reward, and observes the costs and terminal rewards associated with each step of the episode. The goal of the learner is to maximize its cumulative gain (i.e., the terminal reward minus costs) over all episodes by learning to choose the best sequence of actions based on the feedback. First, we define an {em oracle} benchmark, which sequentially selects the actions that maximize the expected immediate gain. Then, we propose our online learning algorithm, named {em FeedBack Adaptive Learning} (FeedBAL), and prove that its regret with respect to the benchmark is bounded with high probability and increases logarithmically in expectation. Moreover, the regret only has polynomial dependence on the number of steps, actions and states. eMAB can be used to model applications that involve humans in the loop, ranging from personalized medical screening to personalized web-based education, where sequences of actions are taken in each episode, and optimal behavior requires adapting the chosen actions based on the feedback.
We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.
370 - Rahul Singh , Fang Liu , Yin Sun 2020
We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm $i$ not only reveals information about its own reward distribution, but also about all those arms that share the same cluster with arm $i$. This correlation amongst the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as $O(Klog T)$, where $K$ is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that is optimal for the classical MABP and does not utilize these dependencies, the regret scales as $O(Mlog T)$ where $M$ is the number of arms.
We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of this drif ted reward feedback by analyzing the performance of three instantiations of the incentivized MAB algorithm: UCB, $varepsilon$-Greedy, and Thompson Sampling. Our results show that they all achieve $mathcal{O}(log T)$ regret and compensation under the drifted reward, and are therefore effective in incentivizing exploration. Numerical examples are provided to complement the theoretical analysis.
We consider the problem where $N$ agents collaboratively interact with an instance of a stochastic $K$ arm bandit problem for $K gg N$. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of $T$ time steps, the number of communication rounds, and the number of bits in each communication round. We present Limited Communication Collaboration - Upper Confidence Bound (LCC-UCB), a doubling-epoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCC-UCB, each agent enjoys a regret of $tilde{O}left(sqrt{({K/N}+ N)T}right)$, communicates for $O(log T)$ steps and broadcasts $O(log K)$ bits in each communication step. We extend the work to sparse graphs with maximum degree $K_G$, and diameter $D$ and propose LCC-UCB-GRAPH which enjoys a regret bound of $tilde{O}left(Dsqrt{(K/N+ K_G)DT}right)$. Finally, we empirically show that the LCC-UCB and the LCC-UCB-GRAPH algorithm perform well and outperform strategies that communicate through a central node

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا