No Arabic abstract
This paper addresses the problem of optimal control using search trees. We start by considering multi-armed bandit problems with continuous action spaces and propose LD-HOO, a limited depth variant of the hierarchical optimistic optimization (HOO) algorithm. We provide a regret analysis for LD-HOO and show that, asymptotically, our algorithm exhibits the same cumulative regret as the original HOO while being faster and more memory efficient. We then propose a Monte Carlo tree search algorithm based on LD-HOO for optimal control problems and illustrate the resulting approachs application in several optimal control problems.
The celebrated Monte Carlo method estimates an expensive-to-compute quantity by random sampling. Bandit-based Monte Carlo optimization is a general technique for computing the minimum of many such expensive-to-compute quantities by adaptive random sampling. The technique converts an optimization problem into a statistical estimation problem which is then solved via multi-armed bandits. We apply this technique to solve the problem of high-dimensional $k$-nearest neighbors, developing an algorithm which we prove is able to identify exact nearest neighbors with high probability. We show that under regularity assumptions on a dataset of $n$ points in $d$-dimensional space, the complexity of our algorithm scales logarithmically with the dimension of the data as $Oleft((n+d)log^2 left(frac{nd}{delta}right)right)$ for error probability $delta$, rather than linearly as in exact computation requiring $O(nd)$. We corroborate our theoretical results with numerical simulations, showing that our algorithm outperforms both exact computation and state-of-the-art algorithms such as kGraph, NGT, and LSH on real datasets.
Sample-based planning is a powerful family of algorithms for generating intelligent behavior from a model of the environment. Generating good candidate actions is critical to the success of sample-based planners, particularly in continuous or large action spaces. Typically, candidate action generation exhausts the action space, uses domain knowledge, or more recently, involves learning a stochastic policy to provide such search guidance. In this paper we explore explicitly learning a candidate action generator by optimizing a novel objective, marginal utility. The marginal utility of an action generator measures the increase in value of an action over previously generated actions. We validate our approach in both curling, a challenging stochastic domain with continuous state and action spaces, and a location game with a discrete but large action space. We show that a generator trained with the marginal utility objective outperforms hand-coded schemes built on substantial domain knowledge, trained stochastic policies, and other natural objectives for generating actions for sampled-based planners.
Urban traffic scenarios often require a high degree of cooperation between traffic participants to ensure safety and efficiency. Observing the behavior of others, humans infer whether or not others are cooperating. This work aims to extend the capabilities of automated vehicles, enabling them to cooperate implicitly in heterogeneous environments. Continuous actions allow for arbitrary trajectories and hence are applicable to a much wider class of problems than existing cooperative approaches with discrete action spaces. Based on cooperative modeling of other agents, Monte Carlo Tree Search (MCTS) in conjunction with Decoupled-UCT evaluates the action-values of each agent in a cooperative and decentralized way, respecting the interdependence of actions among traffic participants. The extension to continuous action spaces is addressed by incorporating novel MCTS-specific enhancements for efficient search space exploration. The proposed algorithm is evaluated under different scenarios, showing that the algorithm is able to achieve effective cooperative planning and generate solutions egocentric planning fails to identify.
Online solvers for partially observable Markov decision processes have difficulty scaling to problems with large action spaces. Monte Carlo tree search with progressive widening attempts to improve scaling by sampling from the action space to construct a policy search tree. The performance of progressive widening search is dependent upon the action sampling policy, often requiring problem-specific samplers. In this work, we present a general method for efficient action sampling based on Bayesian optimization. The proposed method uses a Gaussian process to model a belief over the action-value function and selects the action that will maximize the expected improvement in the optimal action value. We implement the proposed approach in a new online tree search algorithm called Bayesian Optimized Monte Carlo Planning (BOMCP). Several experiments show that BOMCP is better able to scale to large action space POMDPs than existing state-of-the-art tree search solvers.
Continuous-time random disturbances from the renewable generation pose a significant impact on power system dynamic behavior. In evaluating this impact, the disturbances must be considered as continuous-time random processes instead of random variables that do not vary with time to ensure accuracy. Monte Carlo simulation (MCs) is a nonintrusive method to evaluate such impact that can be performed on commercial power system simulation software and is easy for power utilities to use, but is computationally cumbersome. Fast samplings methods such as Latin hypercube sampling (LHS) have been introduced to speed up sampling random variables, but yet cannot be applied to sample continuous disturbances. To overcome this limitation, this paper proposes a fast MCs method that enables the LHS to speed up sampling continuous disturbances, which is based on the It^{o} process model of the disturbances and the approximation of the It^{o} process by functions of independent normal random variables. A case study of the IEEE 39-Bus System shows that the proposed method is 47.6 and 6.7 times faster to converge compared to the traditional MCs in evaluating the expectation and variance of the system dynamic response.