Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

51 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Gal Dalal

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Gal Dalal - Balazs Szorenyi - Gugan Thoppe

الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as `lock-in probability. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.

قيم البحث

76 - Gal Dalal , Balazs Szorenyi , Gugan Thoppe 2019

Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for thi s suite of algorithms. Algorithms such as these have two iterates, $theta_n$ and $w_n,$ which are updated using two distinct stepsize sequences, $alpha_n$ and $beta_n,$ respectively. Assuming $alpha_n = n^{-alpha}$ and $beta_n = n^{-beta}$ with $1 > alpha > beta > 0,$ we show that, with high probability, the two iterates converge to their respective solutions $theta^*$ and $w^*$ at rates given by $|theta_n - theta^*| = tilde{O}( n^{-alpha/2})$ and $|w_n - w^*| = tilde{O}(n^{-beta/2});$ here, $tilde{O}$ hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones.

التعلم الآلي الاحتمالات

Finite Sample Analyses for TD(0) with Function Approximation

126 - Gal Dalal , Balazs Szorenyi , Gugan Thoppe 2017

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Exi sting convergence rates for Temporal Difference (TD) methods apply only to somewhat modifi

الذكاء الاصطناعي

Finite-Time Analysis of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning

330 - Sihan Zeng , Thinh T. Doan , Justin Romberg 2020

Stochastic approximation, a data-driven approach for finding the fixed point of an unknown operator, provides a unified framework for treating many problems in stochastic optimization and reinforcement learning. Motivated by a growing interest in mul ti-agent and multi-task learning, we consider in this paper a decentralized variant of stochastic approximation. A network of agents, each with their own unknown operator and data observations, cooperatively find the fixed point of the aggregate operator. The agents work by running a local stochastic approximation algorithm using noisy samples from their operators while averaging their iterates with their neighbors on a decentralized communication graph. Our main contribution provides a finite-time analysis of this decentralized stochastic approximation algorithm and characterizes the impacts of the underlying communication topology between agents. Our model for the data observed at each agent is that it is sampled from a Markov processes; this lack of independence makes the iterates biased and (potentially) unbounded. Under mild assumptions on the Markov processes, we show that the convergence rate of the proposed methods is essentially the same as if the samples were independent, differing only by a log factor that represents the mixing time of the Markov process. We also present applications of the proposed method on a number of interesting learning problems in multi-agent systems, including a decentralized variant of Q-learning for solving multi-task reinforcement learning.

التعلم الآلي التحسين والتحكم

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

128 - Jun Sun , Gang Wang , Georgios B. Giannakis 2019

Motivated by the emerging use of multi-agent reinforcement learning (MARL) in engineering applications such as networked robotics, swarming drones, and sensor networks, we investigate the policy evaluation problem in a fully decentralized setting, us ing temporal-difference (TD) learning with linear function approximation to handle large state spaces in practice. The goal of a group of agents is to collaboratively learn the value function of a given policy from locally private rewards observed in a shared environment, through exchanging local estimates with neighbors. Despite their simplicity and widespread use, our theoretical understanding of such decentralized TD learning algorithms remains limited. Existing results were obtained based on i.i.d. data samples, or by imposing an `additional projection step to control the `gradient bias incurred by the Markovian observations. In this paper, we provide a finite-sample analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a small neighborhood of the optimum. The resultant error bounds are the first of its type---in the sense that they hold under the most practical assumptions ---which is made possible by means of a novel multi-step Lyapunov analysis.

التعلم الآلي نظرية المعلومات أنظمة وتحكم

Finite-Time Analysis of Asynchronous Stochastic Approximation and $Q$-Learning

104 - Guannan Qu , Adam Wierman 2020

We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to a synchronous $Q$-learning. The resulting bound matches the sharpest available bound for synchronous $Q$-learning, and improves over previous known bounds for asynchronous $Q$-learning.

التحسين والتحكم التعلم الآلي