Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Should All Temporal Difference Learning Use Emphasis?

123 0 0.0 ( 0 )

Download Cite

Added by Xiang Gu

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Xiang Gu - Sina Ghiassian - Richard S. Sutton

Artificial Intelligence Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.

rate research

True Online Temporal-Difference Learning

111 - Harm van Seijen , A. Rupam Mahmood , Patrick M. Pilarski andn Marlos C. Machado 2015

The temporal-difference methods TD($lambda$) and Sarsa($lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, n

Artificial Intelligence Machine Learning

Temporal Difference Uncertainties as a Signal for Exploration

195 - Sebastian Flennerhag , Jane X. Wang , Pablo Sprechmann 2020

An effective approach to exploration in reinforcement learning is to rely on an agents uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agents parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agents temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.

Artificial Intelligence Machine Learning Machine Learning

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

204 - Harm van Seijen 2016

Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear function approximation. The reason is that multi-step methods often yield substantially better performance than their single-step counter-parts, due to a lower bias of the update targets. For non-linear function approximation, however, single-step methods appear to be the norm. Part of the reason could be that on many domains the popular multi-step methods TD($lambda$) and Sarsa($lambda$) do not perform well when combined with non-linear function approximation. In particular, they are very susceptible to divergence of value estimates. In this paper, we identify the reason behind this. Furthermore, based on our analysis, we propose a new multi-step TD method for non-linear function approximation that addresses this issue. We confirm the effectiveness of our method using two benchmark tasks with neural networks as function approximation.

Artificial Intelligence

Preconditioned Temporal Difference Learning

535 - Yao HengShuai 2012

This paper has been withdrawn by the author. This draft is withdrawn for its poor quality in english, unfortunately produced by the author when he was just starting his science route. Look at the ICML version instead: http://icml2008.cs.helsinki.fi/papers/111.pdf

Machine Learning Artificial Intelligence

Preferential Temporal Difference Learning

79 - Nishanth Anand , Doina Precup 2021

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

Machine Learning Artificial Intelligence

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Should All Temporal Difference Learning Use Emphasis?

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions