TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

89 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Joshua Romoff

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Joshua Romoff - Peter Henderson - David Kanaa

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost performance of adaptive optimizers. Our method, TDprop, computes a per parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both $n$-step returns and TD($lambda$). Our theoretical findings demonstrate that including this additional preconditioning information is, surprisingly, comparable to normal semi-gradient TD if the optimal learning rate is found for both via a hyperparameter search. In Deep RL experiments using Expected SARSA, TDprop meets or exceeds the performance of Adam in all tested games under near-optimal learning rates, but a well-tuned SGD can yield similar improvements -- matching our theory. Our findings suggest that Jacobi preconditioning may improve upon typical adaptive optimization methods in Deep RL, but despite incorporating additional information from the TD bootstrap term, may not always be better than SGD.

قيم البحث

64 - Emmanuel Bengio , Joelle Pineau , Doina Precup 2021

A common optimization tool used in deep reinforcement learning is momentum, which consists in accumulating and discounting past gradients, reapplying them at each iteration. We argue that, unlike in supervised learning, momentum in Temporal Differenc e (TD) learning accumulates gradients that become doubly stale: not only does the gradient of the loss change due to parameter updates, the loss itself changes due to bootstrapping. We first show that this phenomenon exists, and then propose a first-order correction term to momentum. We show that this correction term improves sample efficiency in policy evaluation by correcting target value drift. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.

التعلم الآلي التعلم الالي

When Does Preconditioning Help or Hurt Generalization?

111 - Shun-ichi Amari , Jimmy Ba , Roger Grosse 2020

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the textit{implicit bias} of first- and seco nd-order methods affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the shape of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.

التعلم الالي التعلم الآلي

How Does Mixup Help With Robustness and Generalization?

210 - Linjun Zhang , Zhun Deng , Kenji Kawaguchi 2020

Mixup is a popular data augmentation technique based on taking convex combinations of pairs of examples and their labels. This simple technique has been shown to substantially improve both the robustness and the generalization of the trained model. H owever, it is not well-understood why such improvement occurs. In this paper, we provide theoretical analysis to demonstrate how using Mixup in training helps model robustness and generalization. For robustness, we show that minimizing the Mixup loss corresponds to approximately minimizing an upper bound of the adversarial loss. This explains why models obtained by Mixup training exhibits robustness to several kinds of adversarial attacks such as Fast Gradient Sign Method (FGSM). For generalization, we prove that Mixup augmentation corresponds to a specific type of data-adaptive regularization which reduces overfitting. Our analysis provides new insights and a framework to understand Mixup.

التعلم الآلي التعلم الالي

Preconditioned Temporal Difference Learning

487 - Yao HengShuai 2012

This paper has been withdrawn by the author. This draft is withdrawn for its poor quality in english, unfortunately produced by the author when he was just starting his science route. Look at the ICML version instead: http://icml2008.cs.helsinki.fi/papers/111.pdf

التعلم الآلي الذكاء الاصطناعي

Preferential Temporal Difference Learning

79 - Nishanth Anand , Doina Precup 2021

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When t he agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

التعلم الآلي الذكاء الاصطناعي