ﻻ يوجد ملخص باللغة العربية
In this paper, we consider the worst-case regret of Linear Thompson Sampling (LinTS) for the linear bandit problem. citet{russo2014learning} show that the Bayesian regret of LinTS is bounded above by $widetilde{mathcal{O}}(dsqrt{T})$ where $T$ is the time horizon and $d$ is the number of parameters. While this bound matches the minimax lower-bounds for this problem up to logarithmic factors, the existence of a similar worst-case regret bound is still unknown. The only known worst-case regret bound for LinTS, due to cite{agrawal2013thompson,abeille2017linear}, is $widetilde{mathcal{O}}(dsqrt{dT})$ which requires the posterior variance to be inflated by a factor of $widetilde{mathcal{O}}(sqrt{d})$. While this bound is far from the minimax optimal rate by a factor of $sqrt{d}$, in this paper we show that it is the best possible one can get, settling an open problem stated in cite{russo2018tutorial}. Specifically, we construct examples to show that, without the inflation, LinTS can incur linear regret up to time $exp(Omega(d))$. We then demonstrate that, under mild conditions, a slightly modified version of LinTS requires only an $widetilde{mathcal{O}}(1)$ inflation where the constant depends on the diversity of the optimal arm.
Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical pe
Efficient exploration in bandits is a fundamental online learning problem. We propose a variant of Thompson sampling that learns to explore better as it interacts with bandit instances drawn from an unknown prior. The algorithm meta-learns the prior
Thompson Sampling provides an efficient technique to introduce prior knowledge in the multi-armed bandit problem, along with providing remarkable empirical performance. In this paper, we revisit the Thompson Sampling algorithm under rewards drawn fro
We consider a finite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. With this framework, we define an intuitive family of control policies that include Thompson sampl
We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heu