ترغب بنشر مسار تعليمي؟ اضغط هنا

Rate-adaptive model selection over a collection of black-box contextual bandit algorithms

142   0   0.0 ( 0 )
 نشر من قبل Aur\\'elien Bibaut
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

We consider the model selection task in the stochastic contextual bandit setting. Suppose we are given a collection of base contextual bandit algorithms. We provide a master algorithm that combines them and achieves the same performance, up to constants, as the best base algorithm would, if it had been run on its own. Our approach only requires that each algorithm satisfy a high probability regret bound. Our procedure is very simple and essentially does the following: for a well chosen sequence of probabilities $(p_{t})_{tgeq 1}$, at each round $t$, it either chooses at random which candidate to follow (with probability $p_{t}$) or compares, at the same internal sample size for each candidate, the cumulative reward of each, and selects the one that wins the comparison (with probability $1-p_{t}$). To the best of our knowledge, our proposal is the first one to be rate-adaptive for a collection of general black-box contextual bandit algorithms: it achieves the same regret rate as the best candidate. We demonstrate the effectiveness of our method with simulation studies.

قيم البحث

اقرأ أيضاً

169 - Kun Wang , Canzhe Zhao , Shuai Li 2021
Conservative mechanism is a desirable property in decision-making problems which balance the tradeoff between the exploration and exploitation. We propose the novel emph{conservative contextual combinatorial cascading bandit ($C^4$-bandit)}, a cascad ing online learning game which incorporates the conservative mechanism. At each time step, the learning agent is given some contexts and has to recommend a list of items but not worse than the base strategy and then observes the reward by some stopping rules. We design the $C^4$-UCB algorithm to solve the problem and prove its n-step upper regret bound for two situations: known baseline reward and unknown baseline reward. The regret in both situations can be decomposed into two terms: (a) the upper bound for the general contextual combinatorial cascading bandit; and (b) a constant term for the regret from the conservative mechanism. We also improve the bound of the conservative contextual combinatorial bandit as a by-product. Experiments on synthetic data demonstrate its advantages and validate our theoretical analysis.
Precision oncology, the genetic sequencing of tumors to identify druggable targets, has emerged as the standard of care in the treatment of many cancers. Nonetheless, due to the pace of therapy development and variability in patient information, desi gning effective protocols for individual treatment assignment in a sample-efficient way remains a major challenge. One promising approach to this problem is to frame precision oncology treatment as a contextual bandit problem and to apply sequential decision-making algorithms designed to minimize regret in this setting. However, a clear prerequisite for considering this methodology in high-stakes clinical decisions is careful benchmarking to understand realistic costs and benefits. Here, we propose a benchmark dataset to evaluate contextual bandit algorithms based on real in vitro drug response of approximately 900 cancer cell lines. Specifically, we curated a dataset of complete treatment responses for a subset of 7 treatments from prior in vitro studies. This allows us to compute the regret of proposed decision policies using biologically plausible counterfactuals. We ran a suite of Bayesian bandit algorithms on our benchmark, and found that the methods accumulate less regret over a sequence of treatment assignment tasks than a rule-based baseline derived from current clinical practice. This effect was more pronounced when genomic information was included as context. We expect this work to be a starting point for evaluation of both the unique structural requirements and ethical implications for real-world testing of bandit based clinical decision support.
We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algori thm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward, and depends on the gap between the highest reward and other rewards.
In this work, we describe practical lessons we have learned from successfully using contextual bandits (CBs) to improve key business metrics of the Microsoft Virtual Agent for customer support. While our current use cases focus on single step einforc ement learning (RL) and mostly in the domain of natural language processing and information retrieval we believe many of our findings are generally applicable. Through this article, we highlight certain issues that RL practitioners may encounter in similar types of applications as well as offer practical solutions to these challenges.
In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where in every round a decision maker offers a subset (assortment) of products to a consumer, and observes their response. Consumers purchase products so as to maximize their utility. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior by means of the widely used Multinomial Logit (MNL) model, and consider the decision makers problem of dynamically learning the model parameters, while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem and their theoretical performance guarantees depend on a problem dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(sqrt{kappa d T})$, where $kappa$ is a problem dependent constant that can have exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(sqrt{dT} + kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step which allows for tractable decision-making while retaining the favourable regret guarantee.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا