ترغب بنشر مسار تعليمي؟ اضغط هنا

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

71   0   0.0 ( 0 )
 نشر من قبل Akshay Mete
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed to overcome the central obstacle of what is variously called the fundamental closed-identifiability problem of adaptive control, the dual control problem, or, contemporaneously, the exploration vs. exploitation problem. It exploited the key observation that since the maximum likelihood parameter estimator can asymptotically identify the closed-transition probabilities under a certainty equivalent approach, the limiting parameter estimates must necessarily have an optimal reward that is less than the optimal reward attainable for the true but unknown system. Hence it proposed a counteracting reverse bias in favor of parameters with larger optimal rewards, providing a solution to the fundamental problem alluded to above. It thereby proposed an optimistic approach of favoring parameters with larger optimal rewards, now known as optimism in the face of uncertainty. The RBMLE approach has been proved to be long-term average reward optimal in a variety of contexts. However, modern attention is focused on the much finer notion of regret, or finite-time performance. Recent analysis of RBMLE for multi-armed stochastic bandits and linear contextual bandits has shown that it not only has state-of-the-art regret, but it also exhibits empirical performance comparable to or better than the best current contenders, and leads to strikingly simple index policies. Motivated by this, we examine the finite-time performance of RBMLE for reinforcement learning tasks that involve the general problem of optimal control of unknown Markov Decision Processes. We show that it has a regret of $mathcal{O}( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms. Simulation studies show that RBMLE outperforms other algorithms such as UCRL2 and Thompson Sampling.



قيم البحث

اقرأ أيضاً

Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandit s problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency.
Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Expon ential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate $alpha(t)$ in RBMLE, we reveal the nontrivial interplay between $alpha(t)$ and the regret bound that generally applies in both the Exponential Family as well as the sub-Gaussian/Exponential family bandits. To quantify the finite-time performance, we prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of $alpha(t)$ for Gaussian and sub-Gaussian bandits. Extensive experiments demonstrate that the proposed RBMLE achieves empirical regret performance competitive with the state-of-the-art methods, while being more computationally efficient and scalable in comparison to the best-performing ones among them.
Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.
Multimodal learning has achieved great successes in many scenarios. Compared with unimodal learning, it can effectively combine the information from different modalities to improve the performance of learning tasks. In reality, the multimodal data ma y have missing modalities due to various reasons, such as sensor failure and data transmission error. In previous works, the information of the modality-missing data has not been well exploited. To address this problem, we propose an efficient approach based on maximum likelihood estimation to incorporate the knowledge in the modality-missing data. Specifically, we design a likelihood function to characterize the conditional distribution of the modality-complete data and the modality-missing data, which is theoretically optimal. Moreover, we develop a generalized form of the softmax function to effectively implement maximum likelihood estimation in an end-to-end manner. Such training strategy guarantees the computability of our algorithm capably. Finally, we conduct a series of experiments on real-world multimodal datasets. Our results demonstrate the effectiveness of the proposed approach, even when 95% of the training data has missing modality.
Maximum likelihood constraint inference is a powerful technique for identifying unmodeled constraints that affect the behavior of a demonstrator acting under a known objective function. However, it was originally formulated only for discrete state-ac tion spaces. Continuous dynamics are more useful for modeling many real-world systems of interest, including the movements of humans and robots. We present a method to generate a tabular state-action space that approximates continuous dynamics and can be used for constraint inference on demonstrations that obey the true system dynamics. We then demonstrate accurate constraint inference on nonlinear pendulum systems with 2- and 4-dimensional state spaces, and show that performance is robust to a range of hyperparameters. The demonstrations are not required to be fully optimal with respect to the objective, and the most likely constraints can be identified even when demonstrations cover only a small portion of the state space. For these reasons, the proposed approach may be especially useful for inferring constraints on human demonstrators, which has important applications in human-robot interaction and biomechanical medicine.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا