Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

286 0 0.0 ( 0 )

Download Cite

Added by Quanquan Gu

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jiafan He - Dongruo Zhou - Quanquan Gu

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.

rate research

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

93 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $tilde{O}(d^{3}H^5/text{gap}_{text{min}}cdot log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $tilde{O}(d^{2}H^5/text{gap}_{text{min}}cdot log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $text{gap}_{text{min}}$ is the minimal sub-optimality gap, and $tilde O$ hides all logarithmic terms except $log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

Machine Learning Optimization and Control Machine Learning

Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints

97 - Tianhao Wang , Dongruo Zhou , Quanquan Gu 2021

We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: batch learning model and rare policy switch model, and propose two efficient online RL algorithms for linear Markov decision processes. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $sqrt{T/dH}$ batches to obtain $tilde O(sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dHlog T$ policy switches suffice to obtain the $tilde O(sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity.

Machine Learning Optimization and Control Machine Learning

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

114 - Jiafan He , Dongruo Zhou , Quanquan Gu 2021

We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We propose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve $tilde{O}(dHsqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $tilde{Omega}(dHsqrt{T})$ up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.

Machine Learning Optimization and Control Machine Learning

Distributional reinforcement learning with linear function approximation

277 - Marc G. Bellemare , Nicolas Le Roux , Pablo Samuel Castro 2019

Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)s analysis of the C51 algorithm in terms of the Cramer distance, but their results only apply to the tabular setting and ignore C51s use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the models prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.

Machine Learning Machine Learning

Safe Reinforcement Learning with Linear Function Approximation

117 - Sanae Amani , Christos Thrampoulidis , Lin F. Yang 2021

Safety in reinforcement learning has become increasingly important in recent years. Yet, existing solutions either fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems, or fail to provide regret guarantees for settings where safety constraints need to be learned. In this paper, we address both problems by first modeling safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with emph{no safety violation}, achieve a $tilde{mathcal{O}}left(kappasqrt{d^3H^3T}right)$ regret, nearly matching that of state-of-the-art unsafe algorithms, where $H$ is the duration of each episode, $d$ is the dimension of the feature mapping, $kappa$ is a constant characterizing the safety constraints, and $T$ is the total number of action plays. We further present numerical simulations that corroborate our theoretical findings.

Machine Learning Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions