ترغب بنشر مسار تعليمي؟ اضغط هنا

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

101   0   0.0 ( 0 )
 نشر من قبل Yuejie Chi
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. Take a $gamma$-discounted infinite-horizon MDP with state space $mathcal{S}$ and action space $mathcal{A}$: to yield an entrywise $varepsilon$-accurate estimate of the optimal Q-function, state-of-the-art theory for Q-learning proves that a sample size on the order of $frac{|mathcal{S}||mathcal{A}|}{(1-gamma)^5varepsilon^{2}}$ is sufficient, which, however, fails to match with the existing minimax lower bound. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? In this work, we settle these questions by (1) demonstrating that the sample complexity of Q-learning is at most on the order of $frac{|mathcal{S}||mathcal{A}|}{(1-gamma)^4varepsilon^2}$ (up to some log factor) for any $0<varepsilon <1$, and (2) developing a matching lower bound to confirm the sharpness of our result. Our findings unveil both the effectiveness and limitation of Q-learning: its sample complexity matches that of speedy Q-learning without requiring extra computation and storage, albeit still being considerably higher than the minimax lower bound.



قيم البحث

اقرأ أيضاً

107 - Farzan Farnia , David Tse 2016
Given a task of predicting $Y$ from $X$, a loss function $L$, and a set of probability distributions $Gamma$ on $(X,Y)$, what is the optimal decision rule minimizing the worst-case expected loss over $Gamma$? In this paper, we address this question b y introducing a generalization of the principle of maximum entropy. Applying this principle to sets of distributions with marginal on $X$ constrained to be the empirical marginal from the data, we develop a general minimax approach for supervised learning problems. While for some loss functions such as squared-error and log loss, the minimax approach rederives well-knwon regression models, for the 0-1 loss it results in a new linear classifier which we call the maximum entropy machine. The maximum entropy machine minimizes the worst-case 0-1 loss over the structured set of distribution, and by our numerical experiments can outperform other well-known linear classifiers such as SVM. We also prove a bound on the generalization worst-case error in the minimax approach.
$ ewcommand{eps}{varepsilon} $In learning theory, the VC dimension of a concept class $C$ is the most common way to measure its richness. In the PAC model $$ ThetaBig(frac{d}{eps} + frac{log(1/delta)}{eps}Big) $$ examples are necessary and sufficien t for a learner to output, with probability $1-delta$, a hypothesis $h$ that is $eps$-close to the target concept $c$. In the related agnostic model, where the samples need not come from a $cin C$, we know that $$ ThetaBig(frac{d}{eps^2} + frac{log(1/delta)}{eps^2}Big) $$ examples are necessary and sufficient to output an hypothesis $hin C$ whose error is at most $eps$ worse than the best concept in $C$. Here we analyze quantum sample complexity, where each example is a coherent quantum state. This model was introduced by Bshouty and Jackson, who showed that quantum examples are more powerful than classical examples in some fixed-distribution settings. However, Atici and Servedio, improved by Zhang, showed that in the PAC setting, quantum examples cannot be much more powerful: the required number of quantum examples is $$ OmegaBig(frac{d^{1-eta}}{eps} + d + frac{log(1/delta)}{eps}Big)mbox{ for all }eta> 0. $$ Our main result is that quantum and classical sample complexity are in fact equal up to constant factors in both the PAC and agnostic models. We give two approaches. The first is a fairly simple information-theoretic argument that yields the above two classical bounds and yields the same bounds for quantum sample complexity up to a $log(d/eps)$ factor. We then give a second approach that avoids the log-factor loss, based on analyzing the behavior of the Pretty Good Measurement on the quantum state identification problems that correspond to learning. This shows classical and quantum sample complexity are equal up to constant factors.
Here we propose a general theoretical method for analyzing the risk bound in the presence of adversaries. Specifically, we try to fit the adversarial learning problem into the minimax framework. We first show that the original adversarial learning pr oblem can be reduced to a minimax statistical learning problem by introducing a transport map between distributions. Then, we prove a new risk bound for this minimax problem in terms of covering numbers under a weak version of Lipschitz condition. Our method can be applied to multi-class classification problems and commonly used loss functions such as the hinge and ramp losses. As some illustrative examples, we derive the adversarial risk bounds for SVMs, deep neural networks, and PCA, and our bounds have two data-dependent terms, which can be optimized for achieving adversarial robustness.
We provide high probability finite sample complexity guarantees for hidden non-parametric structure learning of tree-shaped graphical models, whose hidden and observable nodes are discrete random variables with either finite or countable alphabets. W e study a fundamental quantity called the (noisy) information threshold, which arises naturally from the error analysis of the Chow-Liu algorithm and, as we discuss, provides explicit necessary and sufficient conditions on sample complexity, by effectively summarizing the difficulty of the tree-structure learning problem. Specifically, we show that the finite sample complexity of the Chow-Liu algorithm for ensuring exact structure recovery from noisy data is inversely proportional to the information threshold squared (provided it is positive), and scales almost logarithmically relative to the number of nodes over a given probability of failure. Conversely, we show that, if the number of samples is less than an absolute constant times the inverse of information threshold squared, then no algorithm can recover the hidden tree structure with probability greater than one half. As a consequence, our upper and lower bounds match with respect to the information threshold, indicating that it is a fundamental quantity for the problem of learning hidden tree-structured models. Further, the Chow-Liu algorithm with noisy data as input achieves the optimal rate with respect to the information threshold. Lastly, as a byproduct of our analysis, we resolve the problem of tree structure learning in the presence of non-identically distributed observation noise, providing conditions for convergence of the Chow-Liu algorithm under this setting, as well.
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem where the transition kernel $P^*$ belongs to a family of models $mathcal{P}^*$ with finite metric entropy. In the model selection framework, instead of $mathcal{P}^*$, we are given $M$ nested families of transition kernels $cP_1 subset cP_2 subset ldots subset cP_M$. We propose and analyze a novel algorithm, namely emph{Adaptive Reinforcement Learning (General)} (texttt{ARL-GEN}) that adapts to the smallest such family where the true transition kernel $P^*$ lies. texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning (texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts a model selection module at the beginning of each epoch. Under a mild separability assumption on the model classes, we show that texttt{ARL-GEN} obtains a regret of $Tilde{mathcal{O}}(d_{mathcal{E}}^*H^2+sqrt{d_{mathcal{E}}^* mathbb{M}^* H^2 T})$, with high probability, where $H$ is the horizon length, $T$ is the total number of steps, $d_{mathcal{E}}^*$ is the Eluder dimension and $mathbb{M}^*$ is the metric entropy corresponding to $mathcal{P}^*$. Note that this regret scaling matches that of an oracle that knows $mathcal{P}^*$ in advance. We show that the cost of model selection for texttt{ARL-GEN} is an additive term in the regret having a weak dependence on $T$. Subsequently, we remove the separability assumption and consider the setup of linear mixture MDPs, where the transition kernel $P^*$ has a linear function approximation. With this low rank structure, we propose novel adaptive algorithms for model selection, and obtain (order-wise) regret identical to that of an oracle with knowledge of the true model class.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا