No Arabic abstract
Recent literature established that neural networks can represent good policies across a range of stochastic dynamic models in supply chain and logistics. We incorporate variance reduction techniques in a newly proposed algorithm, to overcome limitations of the model-free algorithms typically employed to learn such neural network policies. For the classical lost sales inventory model, the algorithm learns neural network policies that are superior to those learned using model-free algorithms, while outperforming the best heuristic benchmarks by an order of magnitude. The algorithm is an interesting candidate to apply to other stochastic dynamic problems in supply chain and logistics, because the ideas in its development are generic.
We consider the canonical periodic review lost sales inventory system with positive lead-times and stochastic i.i.d. demand under the average cost criterion. We introduce a new policy that places orders such that the expected inventory level at the time of arrival of an order is at a fixed level and call it the Projected Inventory Level (PIL) policy. We prove that this policy has a cost-rate superior to the equivalent system where excess demand is back-ordered instead of lost and is therefore asymptotically optimal as the cost of losing a sale approaches infinity under mild distributional assumptions. We further show that this policy dominates the constant order policy for any finite lead-time and is therefore asymptotically optimal as the lead-time approaches infinity for the case of exponentially distributed demand per period. Numerical results show this policy also performs superior relative to other policies.
Reinforcement learning algorithms rely on exploration to discover new behaviors, which is typically achieved by following a stochastic policy. In continuous control tasks, policies with a Gaussian distribution have been widely adopted. Gaussian exploration however does not result in smooth trajectories that generally correspond to safe and rewarding behaviors in practical tasks. In addition, Gaussian policies do not result in an effective exploration of an environment and become increasingly inefficient as the action rate increases. This contributes to a low sample efficiency often observed in learning continuous control tasks. We introduce a family of stationary autoregressive (AR) stochastic processes to facilitate exploration in continuous control domains. We show that proposed processes possess two desirable features: subsequent process observations are temporally coherent with continuously adjustable degree of coherence, and the process stationary distribution is standard normal. We derive an autoregressive policy (ARP) that implements such processes maintaining the standard agent-environment interface. We show how ARPs can be easily used with the existing off-the-shelf learning algorithms. Empirically we demonstrate that using ARPs results in improved exploration and sample efficiency in both simulated and real world domains, and, furthermore, provides smooth exploration trajectories that enable safe operation of robotic hardware.
Learning competitive behaviors in multi-agent settings such as racing requires long-term reasoning about potential adversarial interactions. This paper presents Deep Latent Competition (DLC), a novel reinforcement learning algorithm that learns competitive visual control policies through self-play in imagination. The DLC agent imagines multi-agent interaction sequences in the compact latent space of a learned world model that combines a joint transition function with opponent viewpoint prediction. Imagined self-play reduces costly sample generation in the real world, while the latent representation enables planning to scale gracefully with observation dimensionality. We demonstrate the effectiveness of our algorithm in learning competitive behaviors on a novel multi-agent racing benchmark that requires planning from image observations. Code and videos available at https://sites.google.com/view/deep-latent-competition.
Model predictive control (MPC) is an effective method for controlling robotic systems, particularly autonomous aerial vehicles such as quadcopters. However, application of MPC can be computationally demanding, and typically requires estimating the state of the system, which can be challenging in complex, unstructured environments. Reinforcement learning can in principle forego the need for explicit state estimation and acquire a policy that directly maps sensor readings to actions, but is difficult to apply to unstable systems that are liable to fail catastrophically during training before an effective policy has been found. We propose to combine MPC with reinforcement learning in the framework of guided policy search, where MPC is used to generate data at training time, under full state observations provided by an instrumented training environment. This data is used to train a deep neural network policy, which is allowed to access only the raw observations from the vehicles onboard sensors. After training, the neural network policy can successfully control the robot without knowledge of the full state, and at a fraction of the computational cost of MPC. We evaluate our method by learning obstacle avoidance policies for a simulated quadrotor, using simulated onboard sensors and no explicit state estimation at test time.
We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.