Quantum exploration algorithms for multi-armed bandits

138 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tongyang Li

تاريخ النشر 2020

مجال البحث فيزياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Daochen Wang - Xuchen You - Tongyang Li

فيزياء الكم بنى وهياكل البيانات والخوارزميات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Identifying the best arm of a multi-armed bandit is a central problem in bandit optimization. We study a quantum computational version of this problem with coherent oracle access to states encoding the reward probabilities of each arm as quantum amplitudes. Specifically, we show that we can find the best arm with fixed confidence using $tilde{O}bigl(sqrt{sum_{i=2}^nDelta^{smash{-2}}_i}bigr)$ quantum queries, where $Delta_{i}$ represents the difference between the mean reward of the best arm and the $i^text{th}$-best arm. This algorithm, based on variable-time amplitude amplification and estimation, gives a quadratic speedup compared to the best possible classical result. We also prove a matching quantum lower bound (up to poly-logarithmic factors).

قيم البحث

186 - Haotian Jiang , Jian Li , Mingda Qiao 2017

In the Best-$K$ identification problem (Best-$K$-Arm), we are given $N$ stochastic bandit arms with unknown reward distributions. Our goal is to identify the $K$ arms with the largest means with high confidence, by drawing samples from the arms adapt ively. This problem is motivated by various practical applications and has attracted considerable attention in the past decade. In this paper, we propose new practical algorithms for the Best-$K$-Arm problem, which have nearly optimal sample complexity bounds (matching the lower bound up to logarithmic factors) and outperform the state-of-the-art algorithms for the Best-$K$-Arm problem (even for $K=1$) in practice.

التعلم الآلي بنى وهياكل البيانات والخوارزميات التعلم الالي

Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states

61 - Josep Lumbreras , Erkka Haapasalo , Marco Tomamichel 2021

We initiate the study of tradeoffs between exploration and exploitation in online learning of properties of quantum states. Given sequential oracle access to an unknown quantum state, in each round, we are tasked to choose an observable from a set of actions aiming to maximize its expectation value on the state (the reward). Information gained about the unknown state from previous rounds can be used to gradually improve the choice of action, thus reducing the gap between the reward and the maximal reward attainable with the given action set (the regret). We provide various information-theoretic lower bounds on the cumulative regret that an optimal learner must incur, and show that it scales at least as the square root of the number of rounds played. We also investigate the dependence of the cumulative regret on the number of available actions and the dimension of the underlying space. Moreover, we exhibit strategies that are optimal for bandits with a finite number of arms and general mixed states. If we have a promise that the state is pure and the action set is all rank-1 projectors the regret can be interpreted as the infidelity with the target state and we provide some support for the conjecture that measurement strategies with less regret exist for this case.

فيزياء الكم

Incentivized Exploration for Multi-Armed Bandits under Reward Drift

170 - Zhiyuan Liu , Huazheng Wang , Fan Shen 2019

We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of this drif ted reward feedback by analyzing the performance of three instantiations of the incentivized MAB algorithm: UCB, $varepsilon$-Greedy, and Thompson Sampling. Our results show that they all achieve $mathcal{O}(log T)$ regret and compensation under the drifted reward, and are therefore effective in incentivizing exploration. Numerical examples are provided to complement the theoretical analysis.

التعلم الآلي التعلم الالي

Perturbed-History Exploration in Stochastic Multi-Armed Bandits

185 - Branislav Kveton , Csaba Szepesvari , Mohammad Ghavamzadeh 2019

We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.

التعلم الآلي التعلم الالي

Exploration with Limited Memory: Streaming Algorithms for Coin Tossing, Noisy Comparisons, and Multi-Armed Bandits

165 - Sepehr Assadi , Chen Wang 2020

Consider the following abstract coin tossing problem: Given a set of $n$ coins with unknown biases, find the most biased coin using a minimal number of coin tosses. This is a common abstraction of various exploration problems in theoretical computer science and machine learning and has been studied extensively over the years. In particular, algorithms with optimal sample complexity (number of coin tosses) have been known for this problem for quite some time. Motivated by applications to processing massive datasets, we study the space complexity of solving this problem with optimal number of coin tosses in the streaming model. In this model, the coins are arriving one by one and the algorithm is only allowed to store a limited number of coins at any point -- any coin not present in the memory is lost and can no longer be tossed or compared to arriving coins. Prior algorithms for the coin tossing problem with optimal sample complexity are based on iterative elimination of coins which inherently require storing all the coins, leading to memory-inefficient streaming algorithms. We remedy this state-of-affairs by presenting a series of improved streaming algorithms for this problem: we start with a simple algorithm which require storing only $O(log{n})$ coins and then iteratively refine it further and further, leading to algorithms with $O(loglog{(n)})$ memory, $O(log^*{(n)})$ memory, and finally a one that only stores a single extra coin in memory -- the same exact space needed to just store the best coin throughout the stream. Furthermore, we extend our algorithms to the problem of finding the $k$ most biased coins as well as other exploration problems such as finding top-$k$ elements using noisy comparisons or finding an $epsilon$-best arm in stochastic multi-armed bandits, and obtain efficient streaming algorithms for these problems.

بنى وهياكل البيانات والخوارزميات التعلم الآلي