Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon

179 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Renyuan Xu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية مالية

والبحث باللغة English

تأليف Ben Hambly - Renyuan Xu - Huining Yang

التعلم الآلي التمويل الإحصائي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem. In particular, we consider the convergence of policy gradient methods in the setting of known and unknown parameters. We are able to produce a global linear convergence guarantee for this approach in the setting of finite time horizon and stochastic state dynamics under weak assumptions. The convergence of a projected policy gradient method is also established in order to handle problems with constraints. We illustrate the performance of the algorithm with two examples. The first example is the optimal liquidation of a holding in an asset. We show results for the case where we assume a model for the underlying dynamics and where we apply the method to the data directly. The empirical evidence suggests that the policy gradient method can learn the global optimal solution for a larger class of stochastic systems containing the LQR framework and that it is more robust with respect to model mis-specification when compared to a model-based approach. The second example is an LQR system in a higher dimensional setting with synthetic data.

قيم البحث

72 - Hesameddin Mohammadi , Armin Zare , Mahdi Soltanolkotabi 2019

Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are of ten poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this paper, we take a step towards demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $log , (1/epsilon)$.

التحسين والتحكم الذكاء الاصطناعي التعلم الآلي

Learning the Linear Quadratic Regulator from Nonlinear Observations

225 - Zakaria Mhammedi , Dylan J. Foster , Max Simchowitz 2020

We introduce a new problem setting for continuous control called the LQR with Rich Observations, or RichLQR. In our setting, the environment is summarized by a low-dimensional continuous latent state with linear dynamics and quadratic costs, but the agent operates on high-dimensional, nonlinear observations such as images from a camera. To enable sample-efficient learning, we assume that the learner has access to a class of decoder functions (e.g., neural networks) that is flexible enough to capture the mapping from observations to latent states. We introduce a new algorithm, RichID, which learns a near-optimal policy for the RichLQR with sample complexity scaling only with the dimension of the latent state space and the capacity of the decoder function class. RichID is oracle-efficient and accesses the decoder class only through calls to a least-squares regression oracle. Our results constitute the first provable sample complexity guarantee for continuous control with an unknown nonlinearity in the system model and general function approximation.

التعلم الآلي التحسين والتحكم نظرية الإحصاء

Infinite-horizon Risk-constrained Linear Quadratic Regulator with Average Cost

366 - Feiran Zhao , Keyou You , Tamer Basar 2021

The behaviour of a stochastic dynamical system may be largely influenced by those low-probability, yet extreme events. To address such occurrences, this paper proposes an infinite-horizon risk-constrained Linear Quadratic Regulator (LQR) framework wi th time-average cost. In addition to the standard LQR objective, the average one-stage predictive variance of the state penalty is constrained to lie within a user-specified level. By leveraging the duality, its optimal solution is first shown to be stationary and affine in the state, i.e., $u(x,lambda^*) = -K(lambda^*)x + l(lambda^*)$, where $lambda^*$ is an optimal multiplier, used to address the risk constraint. Then, we establish the stability of the resulting closed-loop system. Furthermore, we propose a primal-dual method with sublinear convergence rate to find an optimal policy $u(x,lambda^*)$. Finally, a numerical example is provided to demonstrate the effectiveness of the proposed framework and the primal-dual method.

التحسين والتحكم أنظمة وتحكم أنظمة وتحكم

On the Analysis of Model-free Methods for the Linear Quadratic Regulator

152 - Zeyu Jin , Johann Michael Schmitt , Zaiwen Wen 2020

Many reinforcement learning methods achieve great success in practice but lack theoretical foundation. In this paper, we study the convergence analysis on the problem of the Linear Quadratic Regulator (LQR). The global linear convergence properties a nd sample complexities are established for several popular algorithms such as the policy gradient algorithm, TD-learning and the actor-critic (AC) algorithm. Our results show that the actor-critic algorithm can reduce the sample complexity compared with the policy gradient algorithm. Although our analysis is still preliminary, it explains the benefit of AC algorithm in a certain sense.

التحسين والتحكم

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with $sqrt{T}$ Regret

216 - Asaf Cassel 2021

We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon T. We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.

التعلم الآلي التعلم الالي