ﻻ يوجد ملخص باللغة العربية
Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in~citet{hasselt2010double} overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum by taking $tilde{Omega}left(left( frac{1}{(1-gamma)^6epsilon^2}right)^{frac{1}{omega}} +left(frac{1}{1-gamma}right)^{frac{1}{1-omega}}right)$ iterations, where $omegain(0,1)$ is the decay parameter of the learning rate, and $gamma$ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.
Multi-agent reinforcement learning (MARL), despite its popularity and empirical success, suffers from the curse of dimensionality. This paper builds the mathematical framework to approximate cooperative MARL by a mean-field control (MFC) approach, an
We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to improve the performance of standard Q-learning in stochastic environments through the use of ``lookahead upper and lower bo
Stochastic approximation, a data-driven approach for finding the fixed point of an unknown operator, provides a unified framework for treating many problems in stochastic optimization and reinforcement learning. Motivated by a growing interest in mul
Bilevel optimization has become a powerful framework in various machine learning applications including meta-learning, hyperparameter optimization, and network architecture search. There are generally two classes of bilevel optimization formulations
We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to a