Policy Distillation

148 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Andrei Rusu

تاريخ النشر 2015

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Andrei A. Rusu - Sergio Gomez Colmenarejo - Caglar Gulcehre

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.

قيم البحث

اقرأ أيضاً

Distilling Policy Distillation

83 - Wojciech Marian Czarnecki , Razvan Pascanu , Simon Osindero 2019

The transfer of knowledge from one policy to another is an important tool in Deep Reinforcement Learning. This process, referred to as distillation, has been used to great success, for example, by enhancing the optimisation of agents, leading to stro nger performance faster, on harder domains [26, 32, 5, 8]. Despite the widespread use and conceptual simplicity of distillation, many different formulations are used in practice, and the subtle variations between them can often drastically change the performance and the resulting objective that is being optimised. In this work, we rigorously explore the entire landscape of policy distillation, comparing the motivations and strengths of each variant through theoretical and empirical analysis. Our results point to three distillation techniques, that are preferred depending on specifics of the task. Specifically a newly proposed expected entropy regularised distillation allows for quicker learning in a wide range of situations, while still guaranteeing convergence.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Dual Policy Distillation

87 - Kwei-Herng Lai , Daochen Zha , Yuening Li 2020

Policy distillation, which transfers a teacher policy to a student policy has achieved great success in challenging tasks of deep reinforcement learning. This teacher-student framework requires a well-trained teacher model which is computationally ex pensive. Moreover, the performance of the student model could be limited by the teacher model if the teacher model is not optimal. In the light of collaborative learning, we study the feasibility of involving joint intellectual efforts from diverse perspectives of student models. In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms, since it is unclear whether the knowledge distilled from an imperfect and noisy peer learner would be helpful. To address the challenge, we theoretically justify that distilling knowledge from a peer learner will lead to policy improvement and propose a disadvantageous distillation strategy based on the theoretical results. The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models.

التعلم الآلي الذكاء الاصطناعي

Neural-to-Tree Policy Distillation with Policy Improvement Criterion

81 - Zhao-Hua Li , Yang Yu , Yingfeng Chen 2021

While deep reinforcement learning has achieved promising results in challenging decision-making tasks, the main bones of its success --- deep neural networks are mostly black-boxes. A feasible way to gain insight into a black-box model is to distill it into an interpretable model such as a decision tree, which consists of if-then rules and is easy to grasp and be verified. However, the traditional model distillation is usually a supervised learning task under a stationary data distribution assumption, which is violated in reinforcement learning. Therefore, a typical policy distillation that clones model behaviors with even a small error could bring a data distribution shift, resulting in an unsatisfied distilled policy model with low fidelity or low performance. In this paper, we propose to address this issue by changing the distillation objective from behavior cloning to maximizing an advantage evaluation. The novel distillation objective maximizes an approximated cumulative reward and focuses more on disastrous behaviors in critical states, which controls the data shift effect. We evaluate our method on several Gym tasks, a commercial fight game, and a self-driving car simulator. The empirical results show that the proposed method can preserve a higher cumulative reward than behavior cloning and learn a more consistent policy to the original one. Moreover, by examining the extracted rules from the distilled decision trees, we demonstrate that the proposed method delivers reasonable and robust decisions.

التعلم الآلي

Policy Distillation and Value Matching in Multiagent Reinforcement Learning

122 - Samir Wadhwania , Dong-Ki Kim , Shayegan Omidshafiei 2019

Multiagent reinforcement learning algorithms (MARL) have been demonstrated on complex tasks that require the coordination of a team of multiple agents to complete. Existing works have focused on sharing information between agents via centralized crit ics to stabilize learning or through communication to increase performance, but do not generally look at how information can be shared between agents to address the curse of dimensionality in MARL. We posit that a multiagent problem can be decomposed into a multi-task problem where each agent explores a subset of the state space instead of exploring the entire state space. This paper introduces a multiagent actor-critic algorithm and method for combining knowledge from homogeneous agents through distillation and value-matching that outperforms policy distillation alone and allows further learning in both discrete and continuous action spaces.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Hierarchical Decomposition of Nonlinear Dynamics and Control for System Identification and Policy Distillation

109 - Hany Abdulsamad , Jan Peters 2020

The control of nonlinear dynamical systems remains a major challenge for autonomous agents. Current trends in reinforcement learning (RL) focus on complex representations of dynamics and policies, which have yielded impressive results in solving a va riety of hard control tasks. However, this new sophistication and extremely over-parameterized models have come with the cost of an overall reduction in our ability to interpret the resulting policies. In this paper, we take inspiration from the control community and apply the principles of hybrid switching systems in order to break down complex dynamics into simpler components. We exploit the rich representational power of probabilistic graphical models and derive an expectation-maximization (EM) algorithm for learning a sequence model to capture the temporal structure of the data and automatically decompose nonlinear dynamics into stochastic switching linear dynamical systems. Moreover, we show how this framework of switching models enables extracting hierarchies of Markovian and auto-regressive locally linear controllers from nonlinear experts in an imitation learning scenario.

التعلم الآلي التعلم الالي