ترغب بنشر مسار تعليمي؟ اضغط هنا

Bandit Linear Control

203   0   0.0 ( 0 )
 نشر من قبل Asaf Cassel
 تاريخ النشر 2020
والبحث باللغة English
 تأليف Asaf Cassel




اسأل ChatGPT حول البحث

We consider the problem of controlling a known linear dynamical system under stochastic noise, adversarially chosen costs, and bandit feedback. Unlike the full feedback setting where the entire cost function is revealed after each decision, here only the cost incurred by the learner is observed. We present a new and efficient algorithm that, for strongly convex and smooth costs, obtains regret that grows with the square root of the time horizon $T$. We also give extensions of this result to general convex, possibly non-smooth costs, and to non-stochastic system noise. A key component of our algorithm is a new technique for addressing bandit optimization of loss functions with memory.



قيم البحث

اقرأ أيضاً

We consider the online multiclass linear classification under the bandit feedback setting. Beygelzimer, P{a}l, Sz{o}r{e}nyi, Thiruvenkatachari, Wei, and Zhang [ICML19] considered two notions of linear separability, weak and strong linear separability . When examples are strongly linearly separable with margin $gamma$, they presented an algorithm based on Multiclass Perceptron with mistake bound $O(K/gamma^2)$, where $K$ is the number of classes. They employed rational kernel to deal with examples under the weakly linearly separable condition, and obtained the mistake bound of $min(Kcdot 2^{tilde{O}(Klog^2(1/gamma))},Kcdot 2^{tilde{O}(sqrt{1/gamma}log K)})$. In this paper, we refine the notion of weak linear separability to support the notion of class grouping, called group weak linear separable condition. This situation may arise from the fact that class structures contain inherent grouping. We show that under this condition, we can also use the rational kernel and obtain the mistake bound of $Kcdot 2^{tilde{O}(sqrt{1/gamma}log L)})$, where $Lleq K$ represents the number of groups.
244 - Yihan Du , Yuko Kuroki , Wei Chen 2020
In this paper, we first study the problem of combinatorial pure exploration with full-bandit feedback (CPE-BL), where a learner is given a combinatorial action space $mathcal{X} subseteq {0,1}^d$, and in each round the learner pulls an action $x in m athcal{X}$ and receives a random reward with expectation $x^{top} theta$, with $theta in mathbb{R}^d$ a latent and unknown environment vector. The objective is to identify the optimal action with the highest expected reward, using as few samples as possible. For CPE-BL, we design the first {em polynomial-time adaptive} algorithm, whose sample complexity matches the lower bound (within a logarithmic factor) for a family of instances and has a light dependence of $Delta_{min}$ (the smallest gap between the optimal action and sub-optimal actions). Furthermore, we propose a novel generalization of CPE-BL with flexible feedback structures, called combinatorial pure exploration with partial linear feedback (CPE-PL), which encompasses several families of sub-problems including full-bandit feedback, semi-bandit feedback, partial feedback and nonlinear reward functions. In CPE-PL, each pull of action $x$ reports a random feedback vector with expectation of $M_{x} theta $, where $M_x in mathbb{R}^{m_x times d}$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$. For CPE-PL, we develop the first {em polynomial-time} algorithm, which simultaneously addresses limited feedback, general reward function and combinatorial action space, and provide its sample complexity analysis. Our empirical evaluation demonstrates that our algorithms run orders of magnitude faster than the existing ones, and our CPE-BL algorithm is robust across different $Delta_{min}$ settings while our CPE-PL algorithm is the only one returning correct answers for nonlinear reward functions.
We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee $O(sqrt{T})$ regret under mild assumptions, where $T$ is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to strongly stable policies that mix exponentially fast to a steady state.
We investigate the sparse linear contextual bandit problem where the parameter $theta$ is sparse. To relieve the sampling inefficiency, we utilize the perturbed adversary where the context is generated adversarilly but with small random non-adaptive perturbations. We prove that the simple online Lasso supports sparse linear contextual bandit with regret bound $mathcal{O}(sqrt{kTlog d})$ even when $d gg T$ where $k$ and $d$ are the number of effective and ambient dimension, respectively. Compared to the recent work from Sivakumar et al. (2020), our analysis does not rely on the precondition processing, adaptive perturbation (the adaptive perturbation violates the i.i.d perturbation setting) or truncation on the error set. Moreover, the special structures in our results explicitly characterize how the perturbation affects exploration length, guide the design of perturbation together with the fundamental performance limit of perturbation method. Numerical experiments are provided to complement the theoretical analysis.
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the m aximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $tilde{O}(sqrt{T})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا