Strong Polynomiality of the Value Iteration Algorithm for Computing Nearly Optimal Policies for Discounted Dynamic Programming

95 0 0.0 ( 0 )

Download Cite

Added by Eugene Feinberg

Publication date 2020

fields

and research's language is English

Authors Eugene A. Feinberg - Gaojin He

Optimization and Control

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This note provides upper bounds on the number of operations required to compute by value iterations a nearly optimal policy for an infinite-horizon discounted Markov decision process with a finite number of states and actions. For a given discount factor, magnitude of the reward function, and desired closeness to optimality, these upper bounds are strongly polynomial in the number of state-action pairs, and one of the provided upper bounds has the property that it is a non-decreasing function of the value of the discount factor.

rate research

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

142 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.

Machine Learning Optimization and Control Machine Learning

Parallel Branch and Bound Algorithm for Computing Maximal Structured Singular Value

350 - Xinjia Chen , Kemin Zhou 2008

In this paper, we have developed a parallel branch and bound algorithm which computes the maximal structured singular value $mu$ without tightly bounding $mu$ for each frequency and thus significantly reduce the computational complexity.

Optimization and Control Dynamical Systems

An Accelerated Fitted Value Iteration Algorithm for MDPs with Finite and Vector-Valued Action Space

101 - Sixiang Zhao , William B. Haskell , Michel-Alexandre Cardin 2019

This paper studies an accelerated fitted value iteration (FVI) algorithm to solve high-dimensional Markov decision processes (MDPs). FVI is an approximate dynamic programming algorithm that has desirable theoretical properties. However, it can be intractable when the action space is finite but vector-valued. To solve such MDPs via FVI, we first approximate the value functions by a two-layer neural network (NN) with rectified linear units (ReLU) being activation functions. We then verify that such approximators are strong enough for the MDP. To speed up the FVI, we recast the action selection problem as a two-stage stochastic programming problem, where the resulting recourse function comes from the two-layer NN. Then, the action selection problem is solved with a specialized multi-cut decomposition algorithm. More specifically, we design valid cuts by exploiting the structure of the approximated value functions to update the actions. We prove that the decomposition can find the global optimal solution in a finite number of iterations and the overall accelerated FVI is consistent. Finally, we verify the performance of the FVI algorithm via a multi-facility capacity investment problem (MCIP). A comprehensive numerical study is implemented, where the results show that the FVI is significantly accelerated without sacrificing too much in precision.

Optimization and Control

Dynamic Programming for General Linear Quadratic Optimal Stochastic Control with Random Coefficients

301 - Shanjian Tang 2014

We are concerned with the linear-quadratic optimal stochastic control problem with random coefficients. Under suitable conditions, we prove that the value field $V(t,x,omega), (t,x,omega)in [0,T]times R^ntimes Omega$, is quadratic in $x$, and has the following form: $V(t,x)=langle K_tx, xrangle$ where $K$ is an essentially bounded nonnegative symmetric matrix-valued adapted processes. Using the dynamic programming principle (DPP), we prove that $K$ is a continuous semi-martingale of the form $$K_t=K_0+int_0^t , dk_s+sum_{i=1}^dint_0^tL_s^i, dW_s^i, quad tin [0,T]$$ with $k$ being a continuous process of bounded variation and $$Eleft[left(int_0^T|L_s|^2, dsright)^pright] <infty, quad forall pge 2; $$ and that $(K, L)$ with $L:=(L^1, cdots, L^d)$ is a solution to the associated backward stochastic Riccati equation (BSRE), whose generator is highly nonlinear in the unknown pair of processes. The uniqueness is also proved via a localized completion of squares in a self-contained manner for a general BSRE. The existence and uniqueness of adapted solution to a general BSRE was initially proposed by the French mathematician J. M. Bismut (1976, 1978). It had been solved by the author (2003) via the stochastic maximum principle with a viewpoint of stochastic flow for the associated stochastic Hamiltonian system. The present paper is its companion, and gives the {it second but more comprehensive} adapted solution to a general BSRE via the DDP. Further extensions to the jump-diffusion control system and to the general nonlinear control system are possible.

Optimization and Control

Dynamic Programming and Linear Programming for Odds Problem

208 - Sachika Kurokawa , Tomomi Matsui 2021

This paper discusses the odds problem, proposed by Bruss in 2000, and its variants. A recurrence relation called a dynamic programming (DP) equation is used to find an optimal stopping policy of the odds problem and its variants. In 2013, Buchbinder, Jain, and Singh proposed a linear programming (LP) formulation for finding an optimal stopping policy of the classical secretary problem, which is a special case of the odds problem. The proposed linear programming problem, which maximizes the probability of a win, differs from the DP equations known for long time periods. This paper shows that an ordinary DP equation is a modification of the dual problem of linear programming including the LP formulation proposed by Buchbinder, Jain, and Singh.

Optimization and Control Applications