Solving Bernoulli Rank-One Bandits with Unimodal Thompson Sampling

80 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Emilie Kaufmann

تاريخ النشر 2019

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Cindy Trinh - Emilie Kaufmann (CNRS - CRIStAL

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Stochastic Rank-One Bandits (Katarya et al, (2017a,b)) are a simple framework for regret minimization problems over rank-one matrices of arms. The initially proposed algorithms are proved to have logarithmic regret, but do not match the existing lower bound for this problem. We close this gap by first proving that rank-one bandits are a particular instance of unimodal bandits, and then providing a new analysis of Unimodal Thompson Sampling (UTS), initially proposed by Paladino et al (2017). We prove an asymptotically optimal regret bound on the frequentist regret of UTS and we support our claims with simulations showing the significant improvement of our method compared to the state-of-the-art.

قيم البحث

اقرأ أيضاً

Thompson Sampling for Unimodal Bandits

120 - Long Yang , Zhao Li , Zehong Hu 2021

In this paper, we propose a Thompson Sampling algorithm for emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $mathcal{O}(log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.

التعلم الآلي الذكاء الاصطناعي

On the Performance of Thompson Sampling on Logistic Bandits

72 - Shi Dong , Tengyu Ma , Benjamin Van Roy 2019

We study the logistic bandit, in which rewards are binary with success probability $exp(beta a^top theta) / (1 + exp(beta a^top theta))$ and actions $a$ and coefficients $theta$ are within the $d$-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter $beta$, we establish a regret bound for Thompson sampling that is independent of $beta$. Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $tilde{O}(dsqrt{T})$. We also establish a $tilde{O}(sqrt{deta T}/lambda)$ bound that applies more broadly, where $lambda$ is the worst-case optimal log-odds and $eta$ is the fragility dimension, a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any $epsilon > 0$, no algorithm can achieve $mathrm{poly}(d, 1/lambda)cdot T^{1-epsilon}$ regret.

التعلم الالي التعلم الآلي

Thompson Sampling for Dynamic Pricing

83 - Ravi Ganti , Matyas Sustik , Quoc Tran 2018

In this paper we apply active learning algorithms for dynamic pricing in a prominent e-commerce website. Dynamic pricing involves changing the price of items on a regular basis, and uses the feedback from the pricing decisions to update prices of the items. Most popular approaches to dynamic pricing use a passive learning approach, where the algorithm uses historical data to learn various parameters of the pricing problem, and uses the updated parameters to generate a new set of prices. We show that one can use active learning algorithms such as Thompson sampling to more efficiently learn the underlying parameters in a pricing problem. We apply our algorithms to a real e-commerce system and show that the algorithms indeed improve revenue compared to pricing algorithms that use passive learning.

التعلم الالي التعلم الآلي

Thompson Sampling for Contextual Bandits with Linear Payoffs

565 - Shipra Agrawal , Navin Goyal 2012

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical pe rformance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studi

التعلم الآلي بنى وهياكل البيانات والخوارزميات التعلم الالي

Thompson Sampling for Linearly Constrained Bandits

103 - Vidit Saxena , Joseph E. Gonzalez , 2020

We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heu ristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O(sqrt{T}) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms. We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

التعلم الآلي التعلم الالي