Model-Augmented Q-learning

51 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Youngmin Oh

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Youngmin Oh - Jinwoo Shin - Eunho Yang

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In recent years, $Q$-learning has become indispensable for model-free reinforcement learning (MFRL). However, it suffers from well-known problems such as under- and overestimation bias of the value, which may adversely affect the policy learning. To resolve this issue, we propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network. We further utilize the estimated reward from the model estimators for $Q$-learning, which promotes interaction between the estimators. We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward. Finally, we also provide a trick to prioritize past experiences in the replay buffer by utilizing model-estimation errors. We experimentally validate MQL built upon state-of-the-art off-policy MFRL methods, and show that MQL largely improves their performance and convergence. The proposed scheme is simple to implement and does not require additional training cost.

قيم البحث

اقرأ أيضاً

Augmented Q Imitation Learning (AQIL)

184 - Xiao Lei Zhang , Anish Agarwal 2020

The study of unsupervised learning can be generally divided into two categories: imitation learning and reinforcement learning. In imitation learning the machine learns by mimicking the behavior of an expert system whereas in reinforcement learning t he machine learns via direct environment feedback. Traditional deep reinforcement learning takes a significant time before the machine starts to converge to an optimal policy. This paper proposes Augmented Q-Imitation-Learning, a method by which deep reinforcement learning convergence can be accelerated by applying Q-imitation-learning as the initial training process in traditional Deep Q-learning.

التعلم الآلي الذكاء الاصطناعي

Discriminator Augmented Model-Based Reinforcement Learning

334 - Behzad Haghgoo , Allan Zhou , Archit Sharma 2021

By planning through a learned dynamics model, model-based reinforcement learning (MBRL) offers the prospect of good performance with little environment interaction. However, it is common in practice for the learned model to be inaccurate, impairing p lanning and leading to poor performance. This paper aims to improve planning with an importance sampling framework that accounts and corrects for discrepancy between the true and learned dynamics. This framework also motivates an alternative objective for fitting the dynamics model: to minimize the variance of value estimation during planning. We derive and implement this objective, which encourages better prediction on trajectories with larger returns. We observe empirically that our approach improves the performance of current MBRL algorithms on two stochastic control problems, and provide a theoretical basis for our method.

التعلم الآلي الذكاء الاصطناعي

Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning

126 - Donghoon Lee 2020

Entropy augmented to reward is known to soften the greedy argmax policy to softmax policy. Entropy augmentation is reformulated and leads to a motivation to introduce an additional entropy term to the objective function in the form of KL-divergence t o regularize optimization process. It results in a policy which monotonically improves while interpolating from the current policy to the softmax greedy policy. This policy is used to build a continuously parameterized algorithm which optimize policy and Q-function simultaneously and whose extreme limits correspond to policy gradient and Q-learning, respectively. Experiments show that there can be a performance gain using an intermediate algorithm.

التعلم الآلي التعلم الالي

Memory Augmented Optimizers for Deep Learning

113 - Paul-Aymeric McRae , Prasanna Parthasarathi , Mahmoud Assran 2021

Popular approaches for minimizing loss in data-driven learning often involve an abstraction or an explicit retention of the history of gradients for efficient parameter updates. The aggregated history of gradients nudges the parameter updates in the right direction even when the gradients at any given step are not informative. Although the history of gradients summarized in meta-parameters or explicitly stored in memory has been shown effective in theory and practice, the question of whether $all$ or only a subset of the gradients in the history are sufficient in deciding the parameter updates remains unanswered. In this paper, we propose a framework of memory-augmented gradient descent optimizers that retain a limited view of their gradient history in their internal memory. Such optimizers scale well to large real-life datasets, and our experiments show that the memory augmented extensions of standard optimizers enjoy accelerated convergence and improved performance on a majority of computer vision and language tasks that we considered. Additionally, we prove that the proposed class of optimizers with fixed-size memory converge under assumptions of strong convexity, regardless of which gradients are selected or how they are linearly combined to form the update step.

التعلم الآلي التحسين والتحكم

Q-Learning with Differential Entropy of Q-Tables

153 - Tung D. Nguyen , Kathryn E. Kasmarik , Hussein A. Abbass 2020

It is well-known that information loss can occur in the classic and simple Q-learning algorithm. Entropy-based policy search methods were introduced to replace Q-learning and to design algorithms that are more robust against information loss. We conj ecture that the reduction in performance during prolonged training sessions of Q-learning is caused by a loss of information, which is non-transparent when only examining the cumulative reward without changing the Q-learning algorithm itself. We introduce Differential Entropy of Q-tables (DE-QT) as an external information loss detector to the Q-learning algorithm. The behaviour of DE-QT over training episodes is analyzed to find an appropriate stopping criterion during training. The results reveal that DE-QT can detect the most appropriate stopping point, where a balance between a high success rate and a high efficiency is met for classic Q-Learning algorithm.

التعلم الآلي التعلم الالي