ﻻ يوجد ملخص باللغة العربية
In this paper, we first study the problem of combinatorial pure exploration with full-bandit feedback (CPE-BL), where a learner is given a combinatorial action space $mathcal{X} subseteq {0,1}^d$, and in each round the learner pulls an action $x in mathcal{X}$ and receives a random reward with expectation $x^{top} theta$, with $theta in mathbb{R}^d$ a latent and unknown environment vector. The objective is to identify the optimal action with the highest expected reward, using as few samples as possible. For CPE-BL, we design the first {em polynomial-time adaptive} algorithm, whose sample complexity matches the lower bound (within a logarithmic factor) for a family of instances and has a light dependence of $Delta_{min}$ (the smallest gap between the optimal action and sub-optimal actions). Furthermore, we propose a novel generalization of CPE-BL with flexible feedback structures, called combinatorial pure exploration with partial linear feedback (CPE-PL), which encompasses several families of sub-problems including full-bandit feedback, semi-bandit feedback, partial feedback and nonlinear reward functions. In CPE-PL, each pull of action $x$ reports a random feedback vector with expectation of $M_{x} theta $, where $M_x in mathbb{R}^{m_x times d}$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$. For CPE-PL, we develop the first {em polynomial-time} algorithm, which simultaneously addresses limited feedback, general reward function and combinatorial action space, and provide its sample complexity analysis. Our empirical evaluation demonstrates that our algorithms run orders of magnitude faster than the existing ones, and our CPE-BL algorithm is robust across different $Delta_{min}$ settings while our CPE-PL algorithm is the only one returning correct answers for nonlinear reward functions.
Combinatorial optimization is one of the fundamental research fields that has been extensively studied in theoretical computer science and operations research. When developing an algorithm for combinatorial optimization, it is commonly assumed that p
Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution $mathcal{P}$. In this work, we learn such policies for an unknown distribution $mathcal{P}$ using samples from $mathcal{P}$. Our
Conservative mechanism is a desirable property in decision-making problems which balance the tradeoff between the exploration and exploitation. We propose the novel emph{conservative contextual combinatorial cascading bandit ($C^4$-bandit)}, a cascad
We formulate a new problem at the intersectionof semi-supervised learning and contextual bandits,motivated by several applications including clini-cal trials and ad recommendations. We demonstratehow Graph Convolutional Network (GCN), a semi-supervis
We study the combinatorial pure exploration problem Best-Set in stochastic multi-armed bandits. In a Best-Set instance, we are given $n$ arms with unknown reward distributions, as well as a family $mathcal{F}$ of feasible subsets over the arms. Our g