ترغب بنشر مسار تعليمي؟ اضغط هنا

Finite-Time Analysis for Double Q-learning

141   0   0.0 ( 0 )
 نشر من قبل Huaqing Xiong
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in~citet{hasselt2010double} overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum by taking $tilde{Omega}left(left( frac{1}{(1-gamma)^6epsilon^2}right)^{frac{1}{omega}} +left(frac{1}{1-gamma}right)^{frac{1}{1-omega}}right)$ iterations, where $omegain(0,1)$ is the decay parameter of the learning rate, and $gamma$ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.



قيم البحث

اقرأ أيضاً

128 - Haotian Gu , Xin Guo , Xiaoli Wei 2020
Multi-agent reinforcement learning (MARL), despite its popularity and empirical success, suffers from the curse of dimensionality. This paper builds the mathematical framework to approximate cooperative MARL by a mean-field control (MFC) approach, an d shows that the approximation error is of $mathcal{O}(frac{1}{sqrt{N}})$. By establishing an appropriate form of the dynamic programming principle for both the value function and the Q function, it proposes a model-free kernel-based Q-learning algorithm (MFC-K-Q), which is shown to have a linear convergence rate for the MFC problem, the first of its kind in the MARL literature. It further establishes that the convergence rate and the sample complexity of MFC-K-Q are independent of the number of agents $N$, which provides an $mathcal{O}(frac{1}{sqrt{N}})$ approximation to the MARL problem with $N$ agents in the learning environment. Empirical studies for the network traffic congestion problem demonstrate that MFC-K-Q outperforms existing MARL algorithms when $N$ is large, for instance when $N>50$.
We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to improve the performance of standard Q-learning in stochastic environments through the use of ``lookahead upper and lower bo unds. To do this, LBQL employs previously collected experience and each iterations state-action values as dual feasible penalties to construct a sequence of sampled information relaxation problems. The solutions to these problems provide estimated upper and lower bounds on the optimal value, which we track via stochastic approximation. These quantities are then used to constrain the iterates to stay within the bounds at every iteration. Numerical experiments on benchmark problems show that LBQL exhibits faster convergence and more robustness to hyperparameters when compared to standard Q-learning and several related techniques. Our approach is particularly appealing in problems that require expensive simulations or real-world interactions.
Stochastic approximation, a data-driven approach for finding the fixed point of an unknown operator, provides a unified framework for treating many problems in stochastic optimization and reinforcement learning. Motivated by a growing interest in mul ti-agent and multi-task learning, we consider in this paper a decentralized variant of stochastic approximation. A network of agents, each with their own unknown operator and data observations, cooperatively find the fixed point of the aggregate operator. The agents work by running a local stochastic approximation algorithm using noisy samples from their operators while averaging their iterates with their neighbors on a decentralized communication graph. Our main contribution provides a finite-time analysis of this decentralized stochastic approximation algorithm and characterizes the impacts of the underlying communication topology between agents. Our model for the data observed at each agent is that it is sampled from a Markov processes; this lack of independence makes the iterates biased and (potentially) unbounded. Under mild assumptions on the Markov processes, we show that the convergence rate of the proposed methods is essentially the same as if the samples were independent, differing only by a log factor that represents the mixing time of the Markov process. We also present applications of the proposed method on a number of interesting learning problems in multi-agent systems, including a decentralized variant of Q-learning for solving multi-task reinforcement learning.
102 - Kaiyi Ji 2021
Bilevel optimization has become a powerful framework in various machine learning applications including meta-learning, hyperparameter optimization, and network architecture search. There are generally two classes of bilevel optimization formulations for machine learning: 1) problem-based bilevel optimization, whose inner-level problem is formulated as finding a minimizer of a given loss function; and 2) algorithm-based bilevel optimization, whose inner-level solution is an output of a fixed algorithm. For the first class, two popular types of gradient-based algorithms have been proposed for hypergradient estimation via approximate implicit differentiation (AID) and iterative differentiation (ITD). Algorithms for the second class include the popular model-agnostic meta-learning (MAML) and almost no inner loop (ANIL). However, the convergence rate and fundamental limitations of bilevel optimization algorithms have not been well explored. This thesis provides a comprehensive convergence rate analysis for bilevel algorithms in the aforementioned two classes. We further propose principled algorithm designs for bilevel optimization with higher efficiency and scalability. For the problem-based formulation, we provide a convergence rate analysis for AID- and ITD-based bilevel algorithms. We then develop acceleration bilevel algorithms, for which we provide shaper convergence analysis with relaxed assumptions. We also provide the first lower bounds for bilevel optimization, and establish the optimality by providing matching upper bounds under certain conditions. We finally propose new stochastic bilevel optimization algorithms with lower complexity and higher efficiency in practice. For the algorithm-based formulation, we develop a theoretical convergence for general multi-step MAML and ANIL, and characterize the impact of parameter selections and loss geometries on the their complexities.
104 - Guannan Qu , Adam Wierman 2020
We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to a synchronous $Q$-learning. The resulting bound matches the sharpest available bound for synchronous $Q$-learning, and improves over previous known bounds for asynchronous $Q$-learning.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا