ﻻ يوجد ملخص باللغة العربية
We consider the problem of controlling a known linear dynamical system under stochastic noise, adversarially chosen costs, and bandit feedback. Unlike the full feedback setting where the entire cost function is revealed after each decision, here only the cost incurred by the learner is observed. We present a new and efficient algorithm that, for strongly convex and smooth costs, obtains regret that grows with the square root of the time horizon $T$. We also give extensions of this result to general convex, possibly non-smooth costs, and to non-stochastic system noise. A key component of our algorithm is a new technique for addressing bandit optimization of loss functions with memory.
We consider the online multiclass linear classification under the bandit feedback setting. Beygelzimer, P{a}l, Sz{o}r{e}nyi, Thiruvenkatachari, Wei, and Zhang [ICML19] considered two notions of linear separability, weak and strong linear separability
In this paper, we first study the problem of combinatorial pure exploration with full-bandit feedback (CPE-BL), where a learner is given a combinatorial action space $mathcal{X} subseteq {0,1}^d$, and in each round the learner pulls an action $x in m
We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee $O(sqrt{T})$ regret under
We investigate the sparse linear contextual bandit problem where the parameter $theta$ is sparse. To relieve the sampling inefficiency, we utilize the perturbed adversary where the context is generated adversarilly but with small random non-adaptive
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the m