ﻻ يوجد ملخص باللغة العربية
Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are of
In this article we consider the ergodic risk-sensitive control problem for a large class of multidimensional controlled diffusions on the whole space. We study the minimization and maximization problems under either a blanket stability hypothesis, or
This paper considers a distributed reinforcement learning problem for decentralized linear quadratic control with partial state observations and local costs. We propose a Zero-Order Distributed Policy Optimization algorithm (ZODPO) that learns linear
We derive equivalent linear and dynamic programs for infinite horizon risk-sensitive control for minimization of the asymptotic growth rate of the cumulative cost.
This paper studies a class of partially observed Linear Quadratic Gaussian (LQG) problems with unknown dynamics. We establish an end-to-end sample complexity bound on learning a robust LQG controller for open-loop stable plants. This is achieved usin