Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

286 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Quanquan Gu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jiafan He - Dongruo Zhou - Quanquan Gu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.

قيم البحث

93 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this pape r, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $tilde{O}(d^{3}H^5/text{gap}_{text{min}}cdot log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $tilde{O}(d^{2}H^5/text{gap}_{text{min}}cdot log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $text{gap}_{text{min}}$ is the minimal sub-optimality gap, and $tilde O$ hides all logarithmic terms except $log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

التعلم الآلي التحسين والتحكم التعلم الالي

Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints

97 - Tianhao Wang , Dongruo Zhou , Quanquan Gu 2021

We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: batch learning model and rare policy switch model, and propose two efficient online RL algorit hms for linear Markov decision processes. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $sqrt{T/dH}$ batches to obtain $tilde O(sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dHlog T$ policy switches suffice to obtain the $tilde O(sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity.

التعلم الآلي التحسين والتحكم التعلم الالي

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

114 - Jiafan He , Dongruo Zhou , Quanquan Gu 2021

We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We pro pose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve $tilde{O}(dHsqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $tilde{Omega}(dHsqrt{T})$ up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.

التعلم الآلي التحسين والتحكم التعلم الالي

Distributional reinforcement learning with linear function approximation

277 - Marc G. Bellemare , Nicolas Le Roux , Pablo Samuel Castro 2019

Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)s analysis of the C51 algorithm in terms of the Cramer distance, but th eir results only apply to the tabular setting and ignore C51s use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the models prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.

التعلم الآلي التعلم الالي

Safe Reinforcement Learning with Linear Function Approximation

117 - Sanae Amani , Christos Thrampoulidis , Lin F. Yang 2021

Safety in reinforcement learning has become increasingly important in recent years. Yet, existing solutions either fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems, or fail to provide r egret guarantees for settings where safety constraints need to be learned. In this paper, we address both problems by first modeling safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with emph{no safety violation}, achieve a $tilde{mathcal{O}}left(kappasqrt{d^3H^3T}right)$ regret, nearly matching that of state-of-the-art unsafe algorithms, where $H$ is the duration of each episode, $d$ is the dimension of the feature mapping, $kappa$ is a constant characterizing the safety constraints, and $T$ is the total number of action plays. We further present numerical simulations that corroborate our theoretical findings.

التعلم الآلي التعلم الالي