ترغب بنشر مسار تعليمي؟ اضغط هنا

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

205   0   0.0 ( 0 )
 نشر من قبل Harm Van Seijen
 تاريخ النشر 2016
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English
 تأليف Harm van Seijen




اسأل ChatGPT حول البحث

Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear function approximation. The reason is that multi-step methods often yield substantially better performance than their single-step counter-parts, due to a lower bias of the update targets. For non-linear function approximation, however, single-step methods appear to be the norm. Part of the reason could be that on many domains the popular multi-step methods TD($lambda$) and Sarsa($lambda$) do not perform well when combined with non-linear function approximation. In particular, they are very susceptible to divergence of value estimates. In this paper, we identify the reason behind this. Furthermore, based on our analysis, we propose a new multi-step TD method for non-linear function approximation that addresses this issue. We confirm the effectiveness of our method using two benchmark tasks with neural networks as function approximation.



قيم البحث

اقرأ أيضاً

Motivated by the emerging use of multi-agent reinforcement learning (MARL) in engineering applications such as networked robotics, swarming drones, and sensor networks, we investigate the policy evaluation problem in a fully decentralized setting, us ing temporal-difference (TD) learning with linear function approximation to handle large state spaces in practice. The goal of a group of agents is to collaboratively learn the value function of a given policy from locally private rewards observed in a shared environment, through exchanging local estimates with neighbors. Despite their simplicity and widespread use, our theoretical understanding of such decentralized TD learning algorithms remains limited. Existing results were obtained based on i.i.d. data samples, or by imposing an `additional projection step to control the `gradient bias incurred by the Markovian observations. In this paper, we provide a finite-sample analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a small neighborhood of the optimum. The resultant error bounds are the first of its type---in the sense that they hold under the most practical assumptions ---which is made possible by means of a novel multi-step Lyapunov analysis.
The temporal-difference methods TD($lambda$) and Sarsa($lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, n
Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but i t is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.
98 - Quanlin Chen 2021
Multi-agent value-based approaches recently make great progress, especially value decomposition methods. However, there are still a lot of limitations in value function factorization. In VDN, the joint action-value function is the sum of per-agent ac tion-value function while the joint action-value function of QMIX is the monotonic mixing of per-agent action-value function. To some extent, QTRAN reduces the limitation of joint action-value functions that can be represented, but it has unsatisfied performance in complex tasks. In this paper, in order to extend the class of joint value functions that can be represented, we propose a novel actor-critic method called NQMIX. NQMIX introduces an off-policy policy gradient on QMIX and modify its network architecture, which can remove the monotonicity constraint of QMIX and implement a non-monotonic value function factorization for the joint action-value function. In addition, NQMIX takes the state-value as the learning target, which overcomes the problem in QMIX that the learning target is overestimated. Furthermore, NQMIX can be extended to continuous action space settings by introducing deterministic policy gradient on itself. Finally, we evaluate our actor-critic methods on SMAC domain, and show that it has a stronger performance than COMA and QMIX on complex maps with heterogeneous agent types. In addition, our ablation results show that our modification of mixer is effective.
An effective approach to exploration in reinforcement learning is to rely on an agents uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve functio n approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agents parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agents temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا