Measurable Monte Carlo Search Error Bounds

187 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل John Mern

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف John Mern - Mykel J. Kochenderfer

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Monte Carlo planners can often return sub-optimal actions, even if they are guaranteed to converge in the limit of infinite samples. Known asymptotic regret bounds do not provide any way to measure confidence of a recommended action at the conclusion of search. In this work, we prove bounds on the sub-optimality of Monte Carlo estimates for non-stationary bandits and Markov decision processes. These bounds can be directly computed at the conclusion of the search and do not require knowledge of the true action-value. The presented bound holds for general Monte Carlo solvers meeting mild convergence conditions. We empirically test the tightness of the bounds through experiments on a multi-armed bandit and a discrete Markov decision process for both a simple solver and Monte Carlo tree search.

قيم البحث

اقرأ أيضاً

Quasi-Monte Carlo Software

103 - Sou-Cheng T. Choi , Fred J. Hickernell , R. Jagadeeswaran 2021

Practitioners wishing to experience the efficiency gains from using low discrepancy sequences need correct, well-written software. This article, based on our MCQMC 2020 tutorial, describes some of the better quasi-Monte Carlo (QMC) software available . We highlight the key software components required to approximate multivariate integrals or expectations of functions of vector random variables by QMC. We have combined these components in QMCPy, a Python open source library, which we hope will draw the support of the QMC community. Here we introduce QMCPy.

البرمجيات الرياضية التحليل العددي التحليل العددي

Multiple Policy Value Monte Carlo Tree Search

168 - Li-Cheng Lan , Wei Li , Ting-Han Wei 2019

Many of the strongest game playing programs use a combination of Monte Carlo tree search (MCTS) and deep neural networks (DNN), where the DNNs are used as policy or value evaluators. Given a limited budget, such as online playing or during the self-p lay phase of AlphaZero (AZ) training, a balance needs to be reached between accurate state estimation and more MCTS simulations, both of which are critical for a strong game playing agent. Typically, larger DNNs are better at generalization and accurate evaluation, while smaller DNNs are less costly, and therefore can lead to more MCTS simulations and bigger search trees with the same budget. This paper introduces a new method called the multiple policy value MCTS (MPV-MCTS), which combines multiple policy value neural networks (PV-NNs) of various sizes to retain advantages of each network, where two PV-NNs f_S and f_L are used in this paper. We show through experiments on the game NoGo that a combined f_S and f_L MPV-MCTS outperforms single PV-NN with policy value MCTS, called PV-MCTS. Additionally, MPV-MCTS also outperforms PV-MCTS for AZ training.

الذكاء الاصطناعي

Monte Carlo Simulation of SDEs using GANs

116 - Jorino van Rhijn , Cornelis W. Oosterlee , Lech A. Grzelak 2021

Generative adversarial networks (GANs) have shown promising results when applied on partial differential equations and financial time series generation. We investigate if GANs can also be used to approximate one-dimensional Ito stochastic differentia l equations (SDEs). We propose a scheme that approximates the path-wise conditional distribution of SDEs for large time steps. Standard GANs are only able to approximate processes in distribution, yielding a weak approximation to the SDE. A conditional GAN architecture is proposed that enables strong approximation. We inform the discriminator of this GAN with the map between the prior input to the generator and the corresponding output samples, i.e. we introduce a `supervised GAN. We compare the input-output map obtained with the standard GAN and supervised GAN and show experimentally that the standard GAN may fail to provide a path-wise approximation. The GAN is trained on a dataset obtained with exact simulation. The architecture was tested on geometric Brownian motion (GBM) and the Cox-Ingersoll-Ross (CIR) process. The supervised GAN outperformed the Euler and Milstein schemes in strong error on a discretisation with large time steps. It also outperformed the standard conditional GAN when approximating the conditional distribution. We also demonstrate how standard GANs may give rise to non-parsimonious input-output maps that are sensitive to perturbations, which motivates the need for constraints and regularisation on GAN generators.

التعلم الآلي التحليل العددي التحليل العددي

Error in Monte Carlo, quasi-error in Quasi-Monte Carlo

155 - R. H. Kleiss , A. Lazopoulos 2005

While the Quasi-Monte Carlo method of numerical integration achieves smaller integration error than standard Monte Carlo, its use in particle physics phenomenology has been hindered by the abscence of a reliable way to estimate that error. The standa rd Monte Carlo error estimator relies on the assumption that the points are generated independently of each other and, therefore, fails to account for the error improvement advertised by the Quasi-Monte Carlo method. We advocate the construction of an estimator of stochastic nature, based on the ensemble of pointsets with a particular discrepancy value. We investigate the consequences of this choice and give some first empirical results on the suggested estimators.

فيزياء الطاقة العالية - الظواهر

Generalized Mean Estimation in Monte-Carlo Tree Search

84 - Tuan Dam , Pascal Klink , Carlo DEramo 2019

We consider Monte-Carlo Tree Search (MCTS) applied to Markov Decision Processes (MDPs) and Partially Observable MDPs (POMDPs), and the well-known Upper Confidence bound for Trees (UCT) algorithm. In UCT, a tree with nodes (states) and edges (actions) is incrementally built by the expansion of nodes, and the values of nodes are updated through a backup strategy based on the average value of child nodes. However, it has been shown that with enough samples the maximum operator yields more accurate node value estimates than averaging. Instead of settling for one of these value estimates, we go a step further proposing a novel backup strategy which uses the power mean operator, which computes a value between the average and maximum value. We call our new approach Power-UCT, and argue how the use of the power mean operator helps to speed up the learning in MCTS. We theoretically analyze our method providing guarantees of convergence to the optimum. Finally, we empirically demonstrate the effectiveness of our method in well-known MDP and POMDP benchmarks, showing significant improvement in performance and convergence speed w.r.t. state of the art algorithms.

الذكاء الاصطناعي