No Arabic abstract
In Path Integral control problems a representation of an optimally controlled dynamical system can be formally computed and serve as a guidepost to learn a parametrized policy. The Path Integral Cross-Entropy (PICE) method tries to exploit this, but is hampered by poor sample efficiency. We propose a model-free algorithm called ASPIC (Adaptive Smoothing of Path Integral Control) that applies an inf-convolution to the cost function to speedup convergence of policy optimization. We identify PICE as the infinite smoothing limit of such technique and show that the sample efficiency problems that PICE suffers disappear for finite levels of smoothing. For zero smoothing this method becomes a greedy optimization of the cost, which is the standard approach in current reinforcement learning. We show analytically and empirically that intermediate levels of smoothing are optimal, which renders the new method superior to both PICE and direct cost-optimization.
In this paper we investigate the problem of controlling a partially observed stochastic dynamical system such that its state is difficult to infer using a (fixed-interval) Bayesian smoother. This problem arises naturally in applications in which it is desirable to keep the entire state trajectory of a system concealed. We pose our smoothing-averse control problem as the problem of maximising the (joint) entropy of smoother state estimates (i.e., the joint conditional entropy of the state trajectory given the history of measurements and controls). We show that the entropy of Bayesian smoother estimates for general nonlinear state-space models can be expressed as the sum of entropies of marginal state estimates given by Bayesian filters. This novel additive form allows us to reformulate the smoothing-averse control problem as a fully observed stochastic optimal control problem in terms of the usual concept of the information (or belief) state, and solve the resulting problem via dynamic programming. We illustrate the applicability of smoothing-averse control to privacy in cloud-based control and covert robotic navigation.
Particle smoothing methods are used for inference of stochastic processes based on noisy observations. Typically, the estimation of the marginal posterior distribution given all observations is cumbersome and computational intensive. In this paper, we propose a simple algorithm based on path integral control theory to estimate the smoothing distribution of continuous-time diffusion processes with partial observations. In particular, we use an adaptive importance sampling method to improve the effective sampling size of the posterior over processes given the observations and the reliability of the estimation of the marginals. This is achieved by estimating a feedback controller to sample efficiently from the joint smoothing distributions. We compare the results with estimations obtained from the standard Forward Filter/Backward Simulator for two diffusion processes of different complexity. We show that the proposed method gives more reliable estimations than the standard FFBSi when the smoothing distribution is poorly represented by the filter distribution.
This paper presents a constrained deep adaptive dynamic programming (CDADP) algorithm to solve general nonlinear optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state constraints. Both the policy and value function are approximated by deep neural networks (NNs), which directly map the system state to action and value function respectively without needing to use hand-crafted basis function. The proposed algorithm considers the state constraints by transforming the policy improvement process to a constrained optimization problem. Meanwhile, a trust region constraint is added to prevent excessive policy update. We first linearize this constrained optimization problem locally into a quadratically-constrained quadratic programming problem, and then obtain the optimal update of policy network parameters by solving its dual problem. We also propose a series of recovery rules to update the policy in case the primal problem is infeasible. In addition, parallel learners are employed to explore different state spaces and then stabilize and accelerate the learning speed. The vehicle control problem in path-tracking task is used to demonstrate the effectiveness of this proposed method.
We apply the meta reinforcement learning framework to optimize an integrated and adaptive guidance and flight control system for an air-to-air missile, implementing the system as a deep neural network (the policy). The policy maps observations directly to commanded rates of change for the missiles control surface deflections, with the observations derived with minimal processing from the computationally stabilized line of sight unit vector measured by a strap down seeker, estimated rotational velocity from rate gyros, and control surface deflection angles. The system induces intercept trajectories against a maneuvering target that satisfy control constraints on fin deflection angles, and path constraints on look angle and load. We test the optimized system in a six degrees-of-freedom simulator that includes a non-linear radome model and a strapdown seeker model. Through extensive simulation, we demonstrate that the system can adapt to a large flight envelope and off nominal flight conditions that include perturbation of aerodynamic coefficient parameters and center of pressure locations. Moreover, we find that the system is robust to the parasitic attitude loop induced by radome refraction, imperfect seeker stabilization, and sensor scale factor errors. Finally, we compare our systems performance to two benchmarks: a proportional navigation guidance system benchmark in a simplified 3-DOF environment, which we take as an upper bound on performance attainable with separate guidance and flight control systems, and a longitudinal model of proportional navigation coupled with a three loop autopilot. We find that our system moderately outperforms the former, and outperforms the latter by a large margin.
Motivated by the lack of systematic tools to obtain safe control laws for hybrid systems, we propose an optimization-based framework for learning certifiably safe control laws from data. In particular, we assume a setting in which the system dynamics are known and in which data exhibiting safe system behavior is available. We propose hybrid control barrier functions for hybrid systems as a means to synthesize safe control inputs. Based on this notion, we present an optimization-based framework to learn such hybrid control barrier functions from data. Importantly, we identify sufficient conditions on the data such that feasibility of the optimization problem ensures correctness of the learned hybrid control barrier functions, and hence the safety of the system. We illustrate our findings in two simulations studies, including a compass gait walker.