ﻻ يوجد ملخص باللغة العربية
The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is another method for improving exploration that aims to combine efficient learning via off-policy updates while maximizing the policy entropy. In this work, we extend SAC to a richer class of probability distributions (e.g., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations. Our approach, which we call SAC-NF, is a simple, efficient,easy-to-implement modification and improvement to SAC on continuous control baselines such as MuJoCo and PyBullet Roboschool domains. Finally, SAC-NF does this while being significantly parameter efficient, using as few as 5.5% the parameters for an equivalent SAC model.
For many applications, such as computing the expected value of different magnitudes, sampling from a known probability density function, the target density, is crucial but challenging through the inverse transform. In these cases, rejection and impor
When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL
In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, c
A normalizing flow is an invertible mapping between an arbitrary probability distribution and a standard normal distribution; it can be used for density estimation and statistical inference. Computing the flow follows the change of variables formula
Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic met