Efficient Online-Bandit Strategies for Minimax Learning Problems

90 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Christophe Roux

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Christophe Roux - Elias Wirth - Sebastian Pokutta

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Several learning problems involve solving min-max problems, e.g., empirical distributional robust learning or learning with non-standard aggregated losses. More specifically, these problems are convex-linear problems where the minimization is carried out over the model parameters $winmathcal{W}$ and the maximization over the empirical distribution $pinmathcal{K}$ of the training set indexes, where $mathcal{K}$ is the simplex or a subset of it. To design efficient methods, we let an online learning algorithm play against a (combinatorial) bandit algorithm. We argue that the efficiency of such approaches critically depends on the structure of $mathcal{K}$ and propose two properties of $mathcal{K}$ that facilitate designing efficient algorithms. We focus on a specific family of sets $mathcal{S}_{n,k}$ encompassing various learning applications and provide high-probability convergence guarantees to the minimax values.

قيم البحث

113 - Zhehui Chen , Lin F. Yang , Chris J. Li 2017

Multiview representation learning is very popular for latent factor analysis. It naturally arises in many data analysis, machine learning, and information retrieval applications to model dependent structures among multiple data sources. For computati onal convenience, existing approaches usually formulate the multiview representation learning as convex optimization problems, where global optima can be obtained by certain algorithms in polynomial time. However, many pieces of evidence have corroborated that heuristic nonconvex approaches also have good empirical computational performance and convergence to the global optima, although there is a lack of theoretical justification. Such a gap between theory and practice motivates us to study a nonconvex formulation for multiview representation learning, which can be efficiently solved by a simple stochastic gradient descent (SGD) algorithm. We first illustrate the geometry of the nonconvex formulation; Then, we establish asymptotic global rates of convergence to the global optima by diffusion approximations. Numerical experiments are provided to support our theory.

التعلم الآلي التحسين والتحكم التعلم الالي

Minimax Regret for Bandit Convex Optimisation of Ridge Functions

68 - Tor Lattimore 2021

We analyse adversarial bandit convex optimisation with an adversary that is restricted to playing functions of the form $f_t(x) = g_t(langle x, thetarangle)$ for convex $g_t : mathbb R to mathbb R$ and unknown $theta in mathbb R^d$ that is homogeneou s over time. We provide a short information-theoretic proof that the minimax regret is at most $O(d sqrt{n} log(n operatorname{diam}(mathcal K)))$ where $n$ is the number of interactions, $d$ the dimension and $operatorname{diam}(mathcal K)$ is the diameter of the constraint set.

التعلم الآلي التحسين والتحكم

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

142 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} a nd the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.

التعلم الآلي التحسين والتحكم التعلم الالي

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

82 - Dongruo Zhou , Quanquan Gu , Csaba Szepesvari 2020

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and th e learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $text{UCRL-VTR}^{+}$ attains an $tilde O(dHsqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $Omega(dHsqrt{T})$ for this setting, which shows that $text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $tilde O(dsqrt{T}/(1-gamma)^{1.5})$ regret, where $gammain [0,1)$ is the discount factor. Our upper bound matches the lower bound $Omega(dsqrt{T}/(1-gamma)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

التعلم الآلي التحسين والتحكم التعلم الالي

Continuous Online Learning and New Insights to Online Imitation Learning

105 - Jonathan Lee , Ching-An Cheng , Ken Goldberg 2019

Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup sometimes fails to capture certain regularity in online problems in practice. Motivated by this, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learners decisions. We show that COL covers and more appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of achieving sublinear dynamic regret. We prove that there is a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs, and we present a reduction from dynamic regret to both static regret and convergence rate of the associated EP. At the end, we specialize these new insights into online imitation learning and show improved understanding of its learning stability.

التعلم الآلي التحسين والتحكم التعلم الالي