A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

94 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mehdi Jafarnia-Jahromi

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Mehdi Jafarnia-Jahromi - Chen-Yu Wei - Rahul Jain

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.

قيم البحث

118 - Yue Wu , Dongruo Zhou , Quanquan Gu 2021

We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $tilde{O}(dsqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $tilde{Omega}(dsqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.

التعلم الآلي التحسين والتحكم التعلم الالي

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

179 - Chen-Yu Wei , Mehdi Jafarnia-Jahromi , Haipeng Luo 2020

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetilde{O}(sqrt{T})$ regret and another computationally efficient variant with $widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetilde{O}(sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).

التعلم الآلي التعلم الالي

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

220 - Chen-Yu Wei , Mehdi Jafarnia-Jahromi , Haipeng Luo 2019

Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Process es (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves $mathcal{O}(T^{2/3})$ regret after $T$ steps, under the minimal assumption of weakly communicating MDPs. To our knowledge, this is the first model-free algorithm for general MDPs in this setting. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to $mathcal{O}(sqrt{T})$, albeit with a stronger ergodic assumption. This result significantly improves over the $mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

114 - Jiafan He , Dongruo Zhou , Quanquan Gu 2021

We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We pro pose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve $tilde{O}(dHsqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $tilde{Omega}(dHsqrt{T})$ up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.

التعلم الآلي التحسين والتحكم التعلم الالي

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

123 - Andrea Tirinzoni , Matteo Pirotta , Alessandro Lazaric 2021

We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an optimization p roblem, our derivation reveals the need for an additional constraint on the visitation distribution over state-action pairs that explicitly accounts for the dynamics of the MDP. We provide a characterization of our lower-bound through a series of examples illustrating how different MDPs may have significantly different complexity. 1) We first consider a difficult MDP instance, where the novel constraint based on the dynamics leads to a larger lower-bound (i.e., a larger regret) compared to the classical analysis. 2) We then show that our lower-bound recovers results previously derived for specific MDP instances. 3) Finally, we show that, in certain simple MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.

التعلم الآلي