Derivative-Free Policy Optimization for Linear Risk-Sensitive and Robust Control Design: Implicit Regularization and Sample Complexity

132 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xiangyuan Zhang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kaiqing Zhang - Xiangyuan Zhang - Bin Hu

التحسين والتحكم الذكاء الاصطناعي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.

قيم البحث

72 - Hesameddin Mohammadi , Armin Zare , Mahdi Soltanolkotabi 2019

Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are of ten poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this paper, we take a step towards demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $log , (1/epsilon)$.

التحسين والتحكم الذكاء الاصطناعي التعلم الآلي

On the policy improvement algorithm for ergodic risk-sensitive control

82 - Ari Arapostathis , Anup Biswas , 2019

In this article we consider the ergodic risk-sensitive control problem for a large class of multidimensional controlled diffusions on the whole space. We study the minimization and maximization problems under either a blanket stability hypothesis, or a near-monotone assumption on the running cost. We establish the convergence of the policy improvement algorithm for these models. We also present a more general result concerning the region of attraction of the equilibrium of the algorithm.

التحسين والتحكم الاحتمالات

Distributed Reinforcement Learning for Decentralized Linear Quadratic Control: A Derivative-Free Policy Optimization Approach

116 - Yingying Li , Yujie Tang , Runyu Zhang 2019

This paper considers a distributed reinforcement learning problem for decentralized linear quadratic control with partial state observations and local costs. We propose a Zero-Order Distributed Policy Optimization algorithm (ZODPO) that learns linear local controllers in a distributed fashion, leveraging the ideas of policy gradient, zero-order optimization and consensus algorithms. In ZODPO, each agent estimates the global cost by consensus, and then conducts local policy gradient in parallel based on zero-order gradient estimation. ZODPO only requires limited communication and storage even in large-scale systems. Further, we investigate the nonasymptotic performance of ZODPO and show that the sample complexity to approach a stationary point is polynomial with the error tolerances inverse and the problem dimensions, demonstrating the scalability of ZODPO. We also show that the controllers generated throughout ZODPO are stabilizing controllers with high probability. Lastly, we numerically test ZODPO on multi-zone HVAC systems.

أنظمة وتحكم التعلم الآلي أنظمة وتحكم

Linear and dynamic programs for risk-sensitive cost minimization

128 - Ari Arapostathis , Vivek S. Borkar 2021

We derive equivalent linear and dynamic programs for infinite horizon risk-sensitive control for minimization of the asymptotic growth rate of the cumulative cost.

التحسين والتحكم

Sample Complexity of Linear Quadratic Gaussian (LQG) Control for Output Feedback Systems

89 - Yang Zheng , Luca Furieri , Maryam Kamgarpour 2020

This paper studies a class of partially observed Linear Quadratic Gaussian (LQG) problems with unknown dynamics. We establish an end-to-end sample complexity bound on learning a robust LQG controller for open-loop stable plants. This is achieved usin g a robust synthesis procedure, where we first estimate a model from a single input-output trajectory of finite length, identify an H-infinity bound on the estimation error, and then design a robust controller using the estimated model and its quantified uncertainty. Our synthesis procedure leverages a recent control tool called Input-Output Parameterization (IOP) that enables robust controller design using convex optimization. For open-loop stable systems, we prove that the LQG performance degrades linearly with respect to the model estimation error using the proposed synthesis procedure. Despite the hidden states in the LQG problem, the achieved scaling matches previous results on learning Linear Quadratic Regulator (LQR) controllers with full state observations.

التحسين والتحكم