ترغب بنشر مسار تعليمي؟ اضغط هنا

KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

125   0   0.0 ( 0 )
 نشر من قبل Gilles Stoltz
 تاريخ النشر 2018
والبحث باللغة English




اسأل ChatGPT حول البحث

In the context of K-armed stochastic bandits with distribution only assumed to be supported by [0,1], we introduce the first algorithm, called KL-UCB-switch, that enjoys simultaneously a distribution-free regret bound of optimal order $sqrt{KT}$ and a distribution-dependent regret bound of optimal order as well, that is, matching the $kappaln T$ lower bound by Lai & Robbins (1985) and Burnetas & Katehakis (1996). This self-contained contribution simultaneously presents state-of-the-art techniques for regret minimization in bandit models, and an elementary construction of non-asymptotic confidence bounds based on the empirical likelihood method for bounded distributions.



قيم البحث

اقرأ أيضاً

The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds $T$. However, there is a gap of $tilde{O}(max(sqrt{d}, sqrt{k}))$ between the current best upper and lower bounds, where $d$ is the dimension of the feature vectors, $k$ is the number of the chosen arms in a round, and $tilde{O}(cdot)$ ignores the logarithmic factors. The dependence of $k$ and $d$ is of practical importance because $k$ may be larger than $T$ in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C${}^2$UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound $tilde{O}(dsqrt{kT} + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C${}^2$UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.
The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in th e sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCB-type algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the state-of-the-art.
Contextual dynamic pricing aims to set personalized prices based on sequential interactions with customers. At each time period, a customer who is interested in purchasing a product comes to the platform. The customers valuation for the product is a linear function of contexts, including product and customer features, plus some random market noise. The seller does not observe the customers true valuation, but instead needs to learn the valuation by leveraging contextual information and historical binary purchase feedbacks. Existing models typically assume full or partial knowledge of the random noise distribution. In this paper, we consider contextual dynamic pricing with unknown random noise in the valuation model. Our distribution-free pricing policy learns both the contextual function and the market noise simultaneously. A key ingredient of our method is a novel perturbed linear bandit framework, where a modified linear upper confidence bound algorithm is proposed to balance the exploration of market noise and the exploitation of the current knowledge for better pricing. We establish the regret upper bound and a matching lower bound of our policy in the perturbed linear bandit framework and prove a sub-linear regret bound in the considered pricing problem. Finally, we demonstrate the superior performance of our policy on simulations and a real-life auto-loan dataset.
We present simple and efficient algorithms for the batched stochastic multi-armed bandit and batched stochastic linear bandit problems. We prove bounds for their expected regrets that improve over the best-known regret bounds for any number of batche s. In particular, our algorithms in both settings achieve the optimal expected regrets by using only a logarithmic number of batches. We also study the batched adversarial multi-armed bandit problem for the first time and find the optimal regret, up to logarithmic factors, of any algorithm with predetermined batch sizes.
104 - Junya Honda 2019
A classic setting of the stochastic K-armed bandit problem is considered in this note. In this problem it has been known that KL-UCB policy achieves the asymptotically optimal regret bound and KL-UCB+ policy empirically performs better than the KL-UC B policy although the regret bound for the original form of the KL-UCB+ policy has been unknown. This note demonstrates that a simple proof of the asymptotic optimality of the KL-UCB+ policy can be given by the same technique as those used for analyses of other known policies.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا