Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

True Online Temporal-Difference Learning

112 0 0.0 ( 0 )

Download Cite

Added by Harm van Seijen

Publication date 2015

fields Informatics Engineering

and research's language is English

Authors Harm van Seijen - A. Rupam Mahmood - Patrick M. Pilarski andn Marlos C. Machado

Artificial Intelligence Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The temporal-difference methods TD($lambda$) and Sarsa($lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, n

rate research

Should All Temporal Difference Learning Use Emphasis?

122 - Xiang Gu , Sina Ghiassian , Richard S. Sutton 2019

Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.

Artificial Intelligence Machine Learning

An Empirical Evaluation of True Online TD({lambda})

457 - Harm van Seijen , A. Rupam Mahmood , Patrick M. Pilarski 2015

The true online TD({lambda}) algorithm has recently been proposed (van Seijen and Sutton, 2014) as a universal replacement for the popular TD({lambda}) algorithm, in temporal-difference learning and reinforcement learning. True online TD({lambda}) has better theoretical properties than conventional TD({lambda}), and the expectation is that it also results in faster learning. In this paper, we put this hypothesis to the test. Specifically, we compare the performance of true online TD({lambda}) with that of TD({lambda}) on challenging examples, random Markov reward processes, and a real-world myoelectric prosthetic arm. We use linear function approximation with tabular, binary, and non-binary features. We assess the algorithms along three dimensions: computational cost, learning speed, and ease of use. Our results confirm the strength of true online TD({lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({lambda}), 2) across all domains/representations the learning speed of true online TD({lambda}) is often better, but never worse than that of TD({lambda}), and 3) true online TD({lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size. Overall, our results suggest that true online TD({lambda}) should be the first choice when looking for an efficient, general-purpose TD method.

Artificial Intelligence Machine Learning Machine Learning

Temporal Difference Uncertainties as a Signal for Exploration

195 - Sebastian Flennerhag , Jane X. Wang , Pablo Sprechmann 2020

An effective approach to exploration in reinforcement learning is to rely on an agents uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agents parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agents temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.

Artificial Intelligence Machine Learning Machine Learning

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

204 - Harm van Seijen 2016

Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear function approximation. The reason is that multi-step methods often yield substantially better performance than their single-step counter-parts, due to a lower bias of the update targets. For non-linear function approximation, however, single-step methods appear to be the norm. Part of the reason could be that on many domains the popular multi-step methods TD($lambda$) and Sarsa($lambda$) do not perform well when combined with non-linear function approximation. In particular, they are very susceptible to divergence of value estimates. In this paper, we identify the reason behind this. Furthermore, based on our analysis, we propose a new multi-step TD method for non-linear function approximation that addresses this issue. We confirm the effectiveness of our method using two benchmark tasks with neural networks as function approximation.

Artificial Intelligence

Preconditioned Temporal Difference Learning

498 - Yao HengShuai 2012

This paper has been withdrawn by the author. This draft is withdrawn for its poor quality in english, unfortunately produced by the author when he was just starting his science route. Look at the ICML version instead: http://icml2008.cs.helsinki.fi/papers/111.pdf

Machine Learning Artificial Intelligence

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

True Online Temporal-Difference Learning

Ask ChatGPT about the research

No Arabic abstract

The temporal-difference methods TD($lambda$) and Sarsa($lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, n

Read More

suggested questions