ﻻ يوجد ملخص باللغة العربية
Policy optimization methods have shown great promise in solving complex reinforcement and imitation learning tasks. While model-free methods are broadly applicable, they often require many samples to optimize complex policies. Model-based methods greatly improve sample-efficiency but at the cost of poor generalization, requiring a carefully handcrafted model of the system dynamics for each task. Recently, hybrid methods have been successful in trading off applicability for improved sample-complexity. However, these have been limited to continuous action spaces. In this work, we present a new hybrid method based on an approximation of the dynamics as an expectation over the next state under the current policy. This relaxation allows us to derive a novel hybrid policy gradient estimator, combining score function and pathwise derivative estimators, that is applicable to discrete action spaces. We show significant gains in sample complexity, ranging between $1.7$ and $25times$, when learning parameterized policies on Cart Pole, Acrobot, Mountain Car and Hand Mass. Our method is applicable to both discrete and continuous action spaces, when competing pathwise methods are limited to the latter.
Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-wo
Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actors Jacobian matrix with the gradient of the c
Maintaining the stability of the modern power grid is becoming increasingly difficult due to fluctuating power consumption, unstable power supply coming from renewable energies, and unpredictable accidents such as man-made and natural disasters. As t
We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the num
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension, making it challenging to apply existing on-policy gradient based deep RL algorithms