ترغب بنشر مسار تعليمي؟ اضغط هنا

Efficient Online-Bandit Strategies for Minimax Learning Problems

90   0   0.0 ( 0 )
 نشر من قبل Christophe Roux
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Several learning problems involve solving min-max problems, e.g., empirical distributional robust learning or learning with non-standard aggregated losses. More specifically, these problems are convex-linear problems where the minimization is carried out over the model parameters $winmathcal{W}$ and the maximization over the empirical distribution $pinmathcal{K}$ of the training set indexes, where $mathcal{K}$ is the simplex or a subset of it. To design efficient methods, we let an online learning algorithm play against a (combinatorial) bandit algorithm. We argue that the efficiency of such approaches critically depends on the structure of $mathcal{K}$ and propose two properties of $mathcal{K}$ that facilitate designing efficient algorithms. We focus on a specific family of sets $mathcal{S}_{n,k}$ encompassing various learning applications and provide high-probability convergence guarantees to the minimax values.



قيم البحث

اقرأ أيضاً

Multiview representation learning is very popular for latent factor analysis. It naturally arises in many data analysis, machine learning, and information retrieval applications to model dependent structures among multiple data sources. For computati onal convenience, existing approaches usually formulate the multiview representation learning as convex optimization problems, where global optima can be obtained by certain algorithms in polynomial time. However, many pieces of evidence have corroborated that heuristic nonconvex approaches also have good empirical computational performance and convergence to the global optima, although there is a lack of theoretical justification. Such a gap between theory and practice motivates us to study a nonconvex formulation for multiview representation learning, which can be efficiently solved by a simple stochastic gradient descent (SGD) algorithm. We first illustrate the geometry of the nonconvex formulation; Then, we establish asymptotic global rates of convergence to the global optima by diffusion approximations. Numerical experiments are provided to support our theory.
68 - Tor Lattimore 2021
We analyse adversarial bandit convex optimisation with an adversary that is restricted to playing functions of the form $f_t(x) = g_t(langle x, thetarangle)$ for convex $g_t : mathbb R to mathbb R$ and unknown $theta in mathbb R^d$ that is homogeneou s over time. We provide a short information-theoretic proof that the minimax regret is at most $O(d sqrt{n} log(n operatorname{diam}(mathcal K)))$ where $n$ is the number of interactions, $d$ the dimension and $operatorname{diam}(mathcal K)$ is the diameter of the constraint set.
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} a nd the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.
We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and th e learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $text{UCRL-VTR}^{+}$ attains an $tilde O(dHsqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $Omega(dHsqrt{T})$ for this setting, which shows that $text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $tilde O(dsqrt{T}/(1-gamma)^{1.5})$ regret, where $gammain [0,1)$ is the discount factor. Our upper bound matches the lower bound $Omega(dsqrt{T}/(1-gamma)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup sometimes fails to capture certain regularity in online problems in practice. Motivated by this, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learners decisions. We show that COL covers and more appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of achieving sublinear dynamic regret. We prove that there is a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs, and we present a reduction from dynamic regret to both static regret and convergence rate of the associated EP. At the end, we specialize these new insights into online imitation learning and show improved understanding of its learning stability.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا