ترغب بنشر مسار تعليمي؟ اضغط هنا

Adaptive Experience Selection for Policy Gradient

352   0   0.0 ( 0 )
 نشر من قبل Saad Mohamad
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

Policy gradient reinforcement learning (RL) algorithms have achieved impressive performance in challenging learning tasks such as continuous control, but suffer from high sample complexity. Experience replay is a commonly used approach to improve sample efficiency, but gradient estimators using past trajectories typically have high variance. Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates. In this paper, we propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance. Using a regret minimisation approach, AES iteratively updates the experience sampling distribution to match the performance of a competitor distribution assumed to have optimal variance. Sample non-stationarity is addressed by proposing a dynamic (i.e. time changing) competitor distribution for which a closed-form solution is proposed. We demonstrate that AES is a low-regret algorithm with reasonable sample complexity. Empirically, AES has been implemented for deep deterministic policy gradient and soft actor critic algorithms, and tested on 8 continuous control tasks from the OpenAI Gym library. Ours results show that AES leads to significantly improved performance compared to currently available experience sampling strategies for policy gradient.

قيم البحث

اقرأ أيضاً

We study reinforcement learning of chatbots with recurrent neural network architectures when the rewards are noisy and expensive to obtain. For instance, a chatbot used in automated customer service support can be scored by quality assurance agents, but this process can be expensive, time consuming and noisy. Previous reinforcement learning work for natural language processing uses on-policy updates and/or is designed for on-line learning settings. We demonstrate empirically that such strategies are not appropriate for this setting and develop an off-policy batch policy gradient method (BPG). We demonstrate the efficacy of our method via a series of synthetic experiments and an Amazon Mechanical Turk experiment on a restaurant recommendations dataset.
In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation ov er on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.
Gaussian Processes (textbf{GPs}) are flexible non-parametric models with strong probabilistic interpretation. While being a standard choice for performing inference on time series, GPs have few techniques to work in a streaming setting. cite{bui2017s treaming} developed an efficient variational approach to train online GPs by using sparsity techniques: The whole set of observations is approximated by a smaller set of inducing points (textbf{IPs}) and moved around with new data. Both the number and the locations of the IPs will affect greatly the performance of the algorithm. In addition to optimizing their locations, we propose to adaptively add new points, based on the properties of the GP and the structure of the data.
Financial markets are complex environments that produce enormous amounts of noisy and non-stationary data. One fundamental problem is online portfolio selection, the goal of which is to exploit this data to sequentially select portfolios of assets to achieve positive investment outcomes while managing risks. Various algorithms have been proposed for solving this problem in fields such as finance, statistics and machine learning, among others. Most of the methods have parameters that are estimated from backtests for good performance. Since these algorithms operate on non-stationary data that reflects the complexity of financial markets, we posit that adaptively tuning these parameters in an intelligent manner is a remedy for dealing with this complexity. In this paper, we model the mapping between the parameter space and the space of performance metrics using a Gaussian process prior. We then propose an oracle based on adaptive Bayesian optimization for automatically and adaptively configuring online portfolio selection methods. We test the efficacy of our solution on algorithms operating on equity and index data from various markets.
We consider the problem of model selection for two popular stochastic linear bandit settings, and propose algorithms that adapts to the unknown problem complexity. In the first setting, we consider the $K$ armed mixture bandits, where the mean reward of arm $i in [K]$, is $mu_i+ langle alpha_{i,t},theta^* rangle $, with $alpha_{i,t} in mathbb{R}^d$ being the known context vector and $mu_i in [-1,1]$ and $theta^*$ are unknown parameters. We define $|theta^*|$ as the problem complexity and consider a sequence of nested hypothesis classes, each positing a different upper bound on $|theta^*|$. Exploiting this, we propose Adaptive Linear Bandit (ALB), a novel phase based algorithm that adapts to the true problem complexity, $|theta^*|$. We show that ALB achieves regret scaling of $O(|theta^*|sqrt{T})$, where $|theta^*|$ is apriori unknown. As a corollary, when $theta^*=0$, ALB recovers the minimax regret for the simple bandit algorithm without such knowledge of $theta^*$. ALB is the first algorithm that uses parameter norm as model section criteria for linear bandits. Prior state of art algorithms cite{osom} achieve a regret of $O(Lsqrt{T})$, where $L$ is the upper bound on $|theta^*|$, fed as an input to the problem. In the second setting, we consider the standard linear bandit problem (with possibly an infinite number of arms) where the sparsity of $theta^*$, denoted by $d^* leq d$, is unknown to the algorithm. Defining $d^*$ as the problem complexity, we show that ALB achieves $O(d^*sqrt{T})$ regret, matching that of an oracle who knew the true sparsity level. This methodology is then extended to the case of finitely many arms and similar results are proven. This is the first algorithm that achieves such model selection guarantees. We further verify our results via synthetic and real-data experiments.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا