Evaluation Function Approximation for Scrabble

332 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Rishabh Agarwal

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Rishabh Agarwal

الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The current state-of-the-art Scrabble agents are not learning-based but depend on truncated Monte Carlo simulations and the quality of such agents is contingent upon the time available for running the simulations. This thesis takes steps towards building a learning-based Scrabble agent using self-play. Specifically, we try to find a better function approximation for the static evaluation function used in Scrabble which determines the move goodness at a given board configuration. In this work, we experimented with evolutionary algorithms and Bayesian Optimization to learn the weights for an approximate feature-based evaluation function. However, these optimization methods were not quite effective, which lead us to explore the given problem from an Imitation Learning point of view. We also tried to imitate the ranking of moves produced by the Quackle simulation agent using supervised learning with a neural network function approximator which takes the raw representation of the Scrabble board as the input instead of using only a fixed number of handcrafted features.

قيم البحث

126 - Gal Dalal , Balazs Szorenyi , Gugan Thoppe 2017

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Exi sting convergence rates for Temporal Difference (TD) methods apply only to somewhat modifi

الذكاء الاصطناعي

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

204 - Harm van Seijen 2016

Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear function approximation. The reason is that multi-step methods often yield substantially better performance than their single-step counter-parts, due to a lower bias of the update targets. For non-linear function approximation, however, single-step methods appear to be the norm. Part of the reason could be that on many domains the popular multi-step methods TD($lambda$) and Sarsa($lambda$) do not perform well when combined with non-linear function approximation. In particular, they are very susceptible to divergence of value estimates. In this paper, we identify the reason behind this. Furthermore, based on our analysis, we propose a new multi-step TD method for non-linear function approximation that addresses this issue. We confirm the effectiveness of our method using two benchmark tasks with neural networks as function approximation.

الذكاء الاصطناعي

Average-Reward Off-Policy Policy Evaluation with Function Approximation

140 - Shangtong Zhang , Yi Wan , Richard S. Sutton 2021

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off -policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

التعلم الآلي الذكاء الاصطناعي

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

175 - Yifei Min , Tianhao Wang , Dongruo Zhou 2021

We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorp orate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.

التعلم الآلي التحسين والتحكم التعلم الالي

Alternative Function Approximation Parameterizations for Solving Games: An Analysis of $f$-Regression Counterfactual Regret Minimization

170 - Ryan DOrazio , Dustin Morrill , James R. Wright 2019

Function approximation is a powerful approach for structuring large decision problems that has facilitated great achievements in the areas of reinforcement learning and game playing. Regression counterfactual regret minimization (RCFR) is a simple al gorithm for approximately solving imperfect information games with normalized rectified linear unit (ReLU) parameterized policies. In contrast, the more conventional softmax parameterization is standard in the field of reinforcement learning and yields a regret bound with a better dependence on the number of actions. We derive approximation error-aware regret bounds for $(Phi, f)$-regret matching, which applies to a general class of link functions and regret objectives. These bounds recover a tighter bound for RCFR and provide a theoretical justification for RCFR implementations with alternative policy parameterizations ($f$-RCFR), including softmax. We provide exploitability bounds for $f$-RCFR with the polynomial and exponential link functions in zero-sum imperfect information games and examine empirically how the link function interacts with the severity of the approximation. We find that the previously studied ReLU parameterization performs better when the approximation error is small while the softmax parameterization can perform better when the approximation error is large.

الذكاء الاصطناعي علوم الكمبيوتر ونظرية الألعاب التعلم الآلي