Continuous Time Bandits With Sampling Costs

84 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Manjesh Kumar Hanawal

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Rahul Vaze - Manjesh K. Hanawal

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider a continuous-time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single-arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We first establish lower bounds on the regret achievable with any algorithm and then propose algorithms that achieve the lower bound up to logarithmic factors. For the single-arm case, we show that the lower bound on the regret is $Omega((log T)^2/mu)$, where $mu$ is the mean of the arm, and $T$ is the time horizon. For the multiple arms case, we show that the lower bound on the regret is $Omega((log T)^2 mu/Delta^2)$, where $mu$ now represents the mean of the best arm, and $Delta$ is the difference of the mean of the best and the second-best arm. We then propose an algorithm that achieves the bound up to constant terms.

قيم البحث

اقرأ أيضاً

Thompson Sampling for Unimodal Bandits

120 - Long Yang , Zhao Li , Zehong Hu 2021

In this paper, we propose a Thompson Sampling algorithm for emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $mathcal{O}(log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.

التعلم الآلي الذكاء الاصطناعي

Bandits with Feedback Graphs and Switching Costs

112 - Raman Arora , Teodor V. Marinov , Mehryar Mohri 2019

We study the adversarial multi-armed bandit problem where partial observations are available and where, in addition to the loss incurred for each action, a emph{switching cost} is incurred for shifting to a new action. All previously known results in cur a factor proportional to the independence number of the feedback graph. We give a new algorithm whose regret guarantee depends only on the domination number of the graph. We further supplement that result with a lower bound. Finally, we also give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.

التعلم الآلي التعلم الالي

Thompson Sampling for Contextual Bandits with Linear Payoffs

233 - Shipra Agrawal , Navin Goyal 2012

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical pe rformance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studi

التعلم الآلي بنى وهياكل البيانات والخوارزميات التعلم الالي

Sampling with Costs

265 - Joseph D. Skufca , Daniel ben-Avraham 2015

We consider the problem of choosing the best of $n$ samples, out of a large random pool, when the sampling of each member is associated with a certain cost. The quality (worth) of the best sample clearly increases with $n$, but so do the sampling cos ts, and one important question is how many to sample for optimal gain (worth minus costs). If, in addition, the assessment of worth for each sample is associated with some measurement error, the perceived best out of $n$ might not be the actual best, complicating the issue. Situations like this are typical in mate selection, job hiring, and food foraging, to name just a few. We tackle the problem by standard order statistics, yielding suggestions for optimal strategies, as well as some unexpected insights.

نظرية الإحصاء نظرية الإحصاء

Bandits with Partially Observable Confounded Data

64 - Guy Tennenholtz , Uri Shalit , Shie Mannor 2020

We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a line ar bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations.

التعلم الآلي الذكاء الاصطناعي التعلم الالي