ﻻ يوجد ملخص باللغة العربية
This paper considers stochastic linear bandits with general nonlinear constraints. The objective is to maximize the expected cumulative reward over horizon $T$ subject to a set of constraints in each round $tauleq T$. We propose a pessimistic-optimistic algorithm for this problem, which is efficient in two aspects. First, the algorithm yields $tilde{cal O}left(left(frac{K^{0.75}}{delta}+dright)sqrt{tau}right)$ (pseudo) regret in round $tauleq T,$ where $K$ is the number of constraints, $d$ is the dimension of the reward feature space, and $delta$ is a Slaters constant; and zero constraint violation in any round $tau>tau,$ where $tau$ is independent of horizon $T.$ Second, the algorithm is computationally efficient. Our algorithm is based on the primal-dual approach in optimization and includes two components. The primal component is similar to unconstrained stochastic linear bandits (our algorithm uses the linear upper confidence bound algorithm (LinUCB)). The computational complexity of the dual component depends on the number of constraints, but is independent of the sizes of the contextual space, the action space, and the feature space. Thus, the overall computational complexity of our algorithm is similar to that of the linear UCB for unconstrained stochastic linear bandits.
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of $T$ rounds is maximum, and each has an expected cost below a certain thresh
This paper considers constrained online dispatching with unknown arrival, reward and constraint distributions. We propose a novel online dispatching algorithm, named POND, standing for Pessimistic-Optimistic oNline Dispatching, which achieves $O(sqrt
Bandit algorithms have various application in safety-critical systems, where it is important to respect the system constraints that rely on the bandits unknown parameters at every round. In this paper, we formulate a linear stochastic multi-armed ban
We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm o
We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) wit