Should All Temporal Difference Learning Use Emphasis?

123 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xiang Gu

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xiang Gu - Sina Ghiassian - Richard S. Sutton

الذكاء الاصطناعي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.

قيم البحث

111 - Harm van Seijen , A. Rupam Mahmood , Patrick M. Pilarski andn Marlos C. Machado 2015

The temporal-difference methods TD($lambda$) and Sarsa($lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, n

الذكاء الاصطناعي التعلم الآلي

Temporal Difference Uncertainties as a Signal for Exploration

195 - Sebastian Flennerhag , Jane X. Wang , Pablo Sprechmann 2020

An effective approach to exploration in reinforcement learning is to rely on an agents uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve functio n approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agents parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agents temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.

الذكاء الاصطناعي التعلم الآلي التعلم الالي

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

204 - Harm van Seijen 2016

Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear function approximation. The reason is that multi-step methods often yield substantially better performance than their single-step counter-parts, due to a lower bias of the update targets. For non-linear function approximation, however, single-step methods appear to be the norm. Part of the reason could be that on many domains the popular multi-step methods TD($lambda$) and Sarsa($lambda$) do not perform well when combined with non-linear function approximation. In particular, they are very susceptible to divergence of value estimates. In this paper, we identify the reason behind this. Furthermore, based on our analysis, we propose a new multi-step TD method for non-linear function approximation that addresses this issue. We confirm the effectiveness of our method using two benchmark tasks with neural networks as function approximation.

الذكاء الاصطناعي

Preconditioned Temporal Difference Learning

523 - Yao HengShuai 2012

This paper has been withdrawn by the author. This draft is withdrawn for its poor quality in english, unfortunately produced by the author when he was just starting his science route. Look at the ICML version instead: http://icml2008.cs.helsinki.fi/papers/111.pdf

التعلم الآلي الذكاء الاصطناعي

Preferential Temporal Difference Learning

79 - Nishanth Anand , Doina Precup 2021

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When t he agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

التعلم الآلي الذكاء الاصطناعي