Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Parameter-free Gradient Temporal Difference Learning

131 0 0.0 ( 0 )

Download Cite

Added by Andrew Jacobsen

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Andrew Jacobsen - Alan Chan

Machine Learning Artificial Intelligence

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Reinforcement learning lies at the intersection of several challenges. Many applications of interest involve extremely large state spaces, requiring function approximation to enable tractable computation. In addition, the learner has only a single stream of experience with which to evaluate a large number of possible courses of action, necessitating algorithms which can learn off-policy. However, the combination of off-policy learning with function approximation leads to divergence of temporal difference methods. Recent work into gradient-based temporal difference methods has promised a path to stability, but at the cost of expensive hyperparameter tuning. In parallel, progress in online learning has provided parameter-free methods that achieve minimax optimal guarantees up to logarithmic terms, but their application in reinforcement learning has yet to be explored. In this work, we combine these two lines of attack, deriving parameter-free, gradient-based temporal difference algorithms. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors. Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.

rate research

Gradient Temporal-Difference Learning with Regularized Corrections

76 - Sina Ghiassian , Andrew Patterson , Shivam Garg 2020

It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously thought. Practitioners face a difficult dilemma: choose an easy to use and performant TD method, or a more complex algorithm that is more sound but harder to tune and all but unexplored with non-linear function approximation or control. In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance. It behaves as well as TD, when TD performs well, but is sound in cases where TD diverges. We empirically investigate TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show, potentially for the first time, that gradient TD methods could be a better alternative to TD and Q-learning.

Machine Learning Artificial Intelligence Machine Learning

Preconditioned Temporal Difference Learning

488 - Yao HengShuai 2012

This paper has been withdrawn by the author. This draft is withdrawn for its poor quality in english, unfortunately produced by the author when he was just starting his science route. Look at the ICML version instead: http://icml2008.cs.helsinki.fi/papers/111.pdf

Machine Learning Artificial Intelligence

Preferential Temporal Difference Learning

79 - Nishanth Anand , Doina Precup 2021

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

Machine Learning Artificial Intelligence

Adaptive Lambda Least-Squares Temporal Difference Learning

73 - Timothy A. Mann , Hugo Penedones , Shie Mannor 2016

Temporal Difference learning or TD($lambda$) is a fundamental algorithm in the field of reinforcement learning. However, setting TDs $lambda$ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the $lambda$ selection problem as a bias-variance trade-off where the solution is the value of $lambda$ that leads to the smallest Mean Squared Value Error (MSVE). To solve this trade-off we suggest applying Leave-One-Trajectory-Out Cross-Validation (LOTO-CV) to search the space of $lambda$ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune $lambda$ and apply function optimization methods to efficiently search the space of $lambda$ values. The resulting algorithm, ALLSTD, is parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the na{i}ve LOTO-CV implementation while achieving similar performance.

Machine Learning Artificial Intelligence Machine Learning

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

96 - Mingde Zhao 2020

Temporal-Difference (TD) learning is a standard and very successful reinforcement learning approach, at the core of both algorithms that learn the value of a given policy, as well as algorithms which learn how to improve policies. TD-learning with eligibility traces provides a way to do temporal credit assignment, i.e. decide which portion of a reward should be assigned to predecessor states that occurred at different previous times, controlled by a parameter $lambda$. However, tuning this parameter can be time-consuming, and not tuning it can lead to inefficient learning. To improve the sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner. The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online, incurring roughly the same computational complexity per step as the usual value learner. Our approach can be used both in on-policy and off-policy learning. We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error. This method can be viewed as a plugin which can also be used to assist prediction with function approximation by meta-learning feature (observation)-based $lambda$ online, or even in the control case to assist policy improvement. Our empirical evaluation demonstrates significant performance improvements, as well as improved robustness of the proposed algorithm to learning rate variation.

Machine Learning Artificial Intelligence Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Parameter-free Gradient Temporal Difference Learning

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions