No Arabic abstract
To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stochastic LQR controller for linear systems subject to Gaussian noises, from the perspective of both RL and primal-dual optimization. From the RL perspective, we first develop a new model-free off-policy policy iteration (MF-OPPI) algorithm, in which the sampled data is repeatedly used for updating the policy to alleviate the data-hungry problem to some extent. We then provide a rigorous analysis for algorithm convergence by showing that the involved iterations are equivalent to the iterations in the classical policy iteration (PI) algorithm. From the perspective of optimization, we first reformulate the stochastic LQR problem at hand as a constrained non-convex optimization problem, which is shown to have strong duality. Then, to solve this non-convex optimization problem, we propose a model-based primal-dual (MB-PD) algorithm based on the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. We also give a model-free implementation for the MB-PD algorithm by solving a transformed dual feasibility condition. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the PI algorithm, respectively. Finally, we provide one simulation example to show the performance of the proposed algorithms.
We show that given a desired closed-loop response for a system, there exists an affine subspace of controllers that achieve this response. By leveraging the existence of this subspace, we are able to separate controller design from closed-loop design by first synthesizing the desired closed-loop response and then synthesizing a controller that achieves the desired response. This is a useful extension to the recently introduced System Level Synthesis framework, in which the controller and closed-loop response are jointly synthesized and we cannot enforce controller-specific constraints without subjecting the closed-loop map to the same constraints. We demonstrate the importance of separating controller design from closed-loop design with an example in which communication delay and locality constraints cause standard SLS to be infeasible. Using our new two-step procedure, we are able to synthesize a controller that obeys the constraints while only incurring a 3% increase in LQR cost compared to the optimal LQR controller.
In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated with the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.
Risk-aware control, though with promise to tackle unexpected events, requires a known exact dynamical model. In this work, we propose a model-free framework to learn a risk-aware controller with a focus on the linear system. We formulate it as a discrete-time infinite-horizon LQR problem with a state predictive variance constraint. To solve it, we parameterize the policy with a feedback gain pair and leverage primal-dual methods to optimize it by solely using data. We first study the optimization landscape of the Lagrangian function and establish the strong duality in spite of its non-convex nature. Alongside, we find that the Lagrangian function enjoys an important local gradient dominance property, which is then exploited to develop a convergent random search algorithm to learn the dual function. Furthermore, we propose a primal-dual algorithm with global convergence to learn the optimal policy-multiplier pair. Finally, we validate our results via simulations.
In this paper, a cooperative Linear Quadratic Regulator (LQR) problem is investigated for multi-input systems, where each input is generated by an agent in a network. The input matrices are different and locally possessed by the corresponding agents respectively, which can be regarded as different ways for agents to control the multi-input system. By embedding a fully distributed information fusion strategy, a novel cooperative LQR-based controller is proposed. Each agent only needs to communicate with its neighbors, rather than sharing information globally in a network. Moreover, only the joint controllability is required, which allows the multi-input system to be uncontrollable for every single agent or even all its neighbors. In particular, only one-time information exchange is necessary at every control step, which significantly reduces the communication consumption. It is proved that the boundedness (convergence) of the controller gains is guaranteed for time-varying (time-invariant) systems. Furthermore, the control performance of the entire system is ensured. Generally, the proposed controller achieves a better trade-off between the control performance and the communication overhead, compared with the existing centralized/decentralized/consensus-based LQR controllers. Finally, the effectiveness of the theoretical results is illustrated by several comparative numerical examples.
We present a framework for systematically combining data of an unknown linear time-invariant system with prior knowledge on the system matrices or on the uncertainty for robust controller design. Our approach leads to linear matrix inequality (LMI) based feasibility criteria which guarantee stability and performance robustly for all closed-loop systems consistent with the prior knowledge and the available data. The design procedures rely on a combination of multipliers inferred via prior knowledge and learnt from measured data, where for the latter a novel and unifying disturbance description is employed. While large parts of the paper focus on linear systems and input-state measurements, we also provide extensions to robust output-feedback design based on noisy input-output data and against nonlinear uncertainties. We illustrate through numerical examples that our approach provides a flexible framework for simultaneously leveraging prior knowledge and data, thereby reducing conservatism and improving performance significantly if compared to black-box approaches to data-driven control.