ترغب بنشر مسار تعليمي؟ اضغط هنا

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

119   0   0.0 ( 0 )
 نشر من قبل Quanquan Gu
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $tilde{O}(dsqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $tilde{Omega}(dsqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.



قيم البحث

اقرأ أيضاً

We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We pro pose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve $tilde{O}(dHsqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $tilde{Omega}(dHsqrt{T})$ up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.
We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetilde{O}(sqrt{T})$ regret and another computationally efficient variant with $widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetilde{O}(sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).
Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learni ng (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} a nd the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.
Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this pape r, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $tilde{O}(d^{3}H^5/text{gap}_{text{min}}cdot log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $tilde{O}(d^{2}H^5/text{gap}_{text{min}}cdot log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $text{gap}_{text{min}}$ is the minimal sub-optimality gap, and $tilde O$ hides all logarithmic terms except $log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا