No Arabic abstract
We introduce a novel method to train agents of reinforcement learning (RL) by sharing knowledge in a way similar to the concept of using a book. The recorded information in the form of a book is the main means by which humans learn knowledge. Nevertheless, the conventional deep RL methods have mainly focused either on experiential learning where the agent learns through interactions with the environment from the start or on imitation learning that tries to mimic the teacher. Contrary to these, our proposed book learning shares key information among different agents in a book-like manner by delving into the following two characteristic features: (1) By defining the linguistic function, input states can be clustered semantically into a relatively small number of core clusters, which are forwarded to other RL agents in a prescribed manner. (2) By defining state priorities and the contents for recording, core experiences can be selected and stored in a small container. We call this container as `BOOK. Our method learns hundreds to thousand times faster than the conventional methods by learning only a handful of core cluster information, which shows that deep RL agents can effectively learn through the shared knowledge from other agents.
The growth in online goods delivery is causing a dramatic surge in urban vehicle traffic from last-mile deliveries. On the other hand, ride-sharing has been on the rise with the success of ride-sharing platforms and increased research on using autonomous vehicle technologies for routing and matching. The future of urban mobility for passengers and goods relies on leveraging new methods that minimize operational costs and environmental footprints of transportation systems. This paper considers combining passenger transportation with goods delivery to improve vehicle-based transportation. Even though the problem has been studied with a defined dynamics model of the transportation system environment, this paper considers a model-free approach that has been demonstrated to be adaptable to new or erratic environment dynamics. We propose FlexPool, a distributed model-free deep reinforcement learning algorithm that jointly serves passengers & goods workloads by learning optimal dispatch policies from its interaction with the environment. The proposed algorithm pools passengers for a ride-sharing service and delivers goods using a multi-hop transit method. These flexibilities decrease the fleets operational cost and environmental footprint while maintaining service levels for passengers and goods. Through simulations on a realistic multi-agent urban mobility platform, we demonstrate that FlexPool outperforms other model-free settings in serving the demands from passengers & goods. FlexPool achieves 30% higher fleet utilization and 35% higher fuel efficiency in comparison to (i) model-free approaches where vehicles transport a combination of passengers & goods without the use of multi-hop transit, and (ii) model-free approaches where vehicles exclusively transport either passengers or goods.
Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions annually. Treating a septic patient is highly challenging, because individual patients respond very differently to medical interventions and there is no universally agreed-upon treatment for sepsis. In this work, we propose an approach to deduce treatment policies for septic patients by using continuous state-space models and deep reinforcement learning. Our model learns clinically interpretable treatment policies, similar in important aspects to the treatment policies of physicians. The learned policies could be used to aid intensive care clinicians in medical decision making and improve the likelihood of patient survival.
Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.
This is a brief technical note to clarify some of the issues with applying the application of the algorithm posterior sampling for reinforcement learning (PSRL) in environments without fixed episodes. In particular, this paper aims to: - Review some of results which have been proven for finite horizon MDPs (Osband et al 2013, 2014a, 2014b, 2016) and also for MDPs with finite ergodic structure (Gopalan et al 2014). - Review similar results for optimistic algorithms in infinite horizon problems (Jaksch et al 2010, Bartlett and Tewari 2009, Abbasi-Yadkori and Szepesvari 2011), with particular attention to the dynamic episode growth. - Highlight the delicate technical issue which has led to a fault in the proof of the lazy-PSRL algorithm (Abbasi-Yadkori and Szepesvari 2015). We present an explicit counterexample to this style of argument. Therefore, we suggest that the Theorem 2 in (Abbasi-Yadkori and Szepesvari 2015) be instead considered a conjecture, as it has no rigorous proof. - Present pragmatic approaches to apply PSRL in infinite horizon problems. We conjecture that, under some additional assumptions, it will be possible to obtain bounds $O( sqrt{T} )$ even without episodic reset. We hope that this note serves to clarify existing results in the field of reinforcement learning and provides interesting motivation for future work.
Although different learning systems are coordinated to afford complex behavior, little is known about how this occurs. This article describes a theoretical framework that specifies how complex behaviors that might be thought to require error-driven learning might instead be acquired through simple reinforcement. This framework includes specific assumptions about the mechanisms that contribute to the evolution of (artificial) neural networks to generate topologies that allow the networks to learn large-scale complex problems using only information about the quality of their performance. The practical and theoretical implications of the framework are discussed, as are possible biological analogs of the approach.