ﻻ يوجد ملخص باللغة العربية
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases.
We introduce a new class of reinforcement learning methods referred to as {em episodic multi-armed bandits} (eMAB). In eMAB the learner proceeds in {em episodes}, each composed of several {em steps}, in which it chooses an action and observes a feedb
We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to
We consider the problem where $N$ agents collaboratively interact with an instance of a stochastic $K$ arm bandit problem for $K gg N$. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of $T$ time steps,
This paper studies a new variant of the stochastic multi-armed bandits problem, where the learner has access to auxiliary information about the arms. The auxiliary information is correlated with the arm rewards, which we treat as control variates. In
During online decision making in Multi-Armed Bandits (MAB), one needs to conduct inference on the true mean reward of each arm based on data collected so far at each step. However, since the arms are adaptively selected--thereby yielding non-iid data