Geometry of Policy Improvement

170 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Johannes Rauh

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Guido Montufar - Johannes Rauh

الذكاء الاصطناعي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most $k$ actions whenever at most $k$ world states are consistent with the agents observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identifies a polyhedral cone of policy changes in which the state value function increases for all states.

قيم البحث

69 - Wendelin Bohmer , Rong Guo , Klaus Obermayer 2016

This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the imp rovements stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The presented approach is simple and should also be easily transferable to more sophisticated algorithms like deep reinforcement learning.

الذكاء الاصطناعي التعلم الآلي التعلم الالي

Neural-to-Tree Policy Distillation with Policy Improvement Criterion

81 - Zhao-Hua Li , Yang Yu , Yingfeng Chen 2021

While deep reinforcement learning has achieved promising results in challenging decision-making tasks, the main bones of its success --- deep neural networks are mostly black-boxes. A feasible way to gain insight into a black-box model is to distill it into an interpretable model such as a decision tree, which consists of if-then rules and is easy to grasp and be verified. However, the traditional model distillation is usually a supervised learning task under a stationary data distribution assumption, which is violated in reinforcement learning. Therefore, a typical policy distillation that clones model behaviors with even a small error could bring a data distribution shift, resulting in an unsatisfied distilled policy model with low fidelity or low performance. In this paper, we propose to address this issue by changing the distillation objective from behavior cloning to maximizing an advantage evaluation. The novel distillation objective maximizes an approximated cumulative reward and focuses more on disastrous behaviors in critical states, which controls the data shift effect. We evaluate our method on several Gym tasks, a commercial fight game, and a self-driving car simulator. The empirical results show that the proposed method can preserve a higher cumulative reward than behavior cloning and learn a more consistent policy to the original one. Moreover, by examining the extracted rules from the distilled decision trees, we demonstrate that the proposed method delivers reasonable and robust decisions.

التعلم الآلي

Iterative Imitation Policy Improvement for Interactive Autonomous Driving

76 - Zhao-Heng Yin , Chenran Li , Liting Sun 2021

We propose an imitation learning system for autonomous driving in urban traffic with interactions. We train a Behavioral Cloning~(BC) policy to imitate driving behavior collected from the real urban traffic, and apply the data aggregation algorithm t o improve its performance iteratively. Applying data aggregation in this setting comes with two challenges. The first challenge is that it is expensive and dangerous to collect online rollout data in the real urban traffic. Creating similar traffic scenarios in simulator like CARLA for online rollout collection can also be difficult. Instead, we propose to create a weak simulator from the training dataset, in which all the surrounding vehicles follow the data trajectory provided by the dataset. We find that the collected online data in such a simulator can still be used to improve BC policys performance. The second challenge is the tedious and time-consuming process of human labelling process during online rollout. To solve this problem, we use an A$^*$ planner as a pseudo-expert to provide expert-like demonstration. We validate our proposed imitation learning system in the real urban traffic scenarios. The experimental results show that our system can significantly improve the performance of baseline BC policy.

علم الروبوتات

Efficient iterative policy optimization

70 - Nicolas Le Roux 2016

We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the num ber of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates.

الذكاء الاصطناعي التعلم الآلي علم الروبوتات

Constrained Policy Improvement for Safe and Efficient Reinforcement Learning

136 - Elad Sarafian , Aviv Tamar , Sarit Kraus 2018

We propose a policy improvement algorithm for Reinforcement Learning (RL) which is called Rerouted Behavior Improvement (RBI). RBI is designed to take into account the evaluation errors of the Q-function. Such errors are common in RL when learning th e $Q$-value from finite past experience data. Greedy policies or even constrained policy optimization algorithms which ignore these errors may suffer from an improvement penalty (i.e. a negative policy improvement). To minimize the improvement penalty, the RBI idea is to attenuate rapid policy changes of low probability actions which were less frequently sampled. This approach is shown to avoid catastrophic performance degradation and reduce regret when learning from a batch of past experience. Through a two-armed bandit with Gaussian distributed rewards example, we show that it also increases data efficiency when the optimal action has a high variance. We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms as a safe learning approach and as a general data efficient learning algorithm. An anonymous Github repository of our RBI implementation is found at https://github.com/eladsar/rbi.

التعلم الآلي الذكاء الاصطناعي التعلم الالي