Boosting One-Point Derivative-Free Online Optimization via Residual Feedback

104 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yan Zhang

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yan Zhang - Yi Zhou - Kaiyi Ji

التعلم الآلي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Zeroth-order optimization (ZO) typically relies on two-point feedback to estimate the unknown gradient of the objective function. Nevertheless, two-point feedback can not be used for online optimization of time-varying objective functions, where only a single query of the function value is possible at each time step. In this work, we propose a new one-point feedback method for online optimization that estimates the objective function gradient using the residual between two feedback points at consecutive time instants. Moreover, we develop regret bounds for ZO with residual feedback for both convex and nonconvex online optimization problems. Specifically, for both deterministic and stochastic problems and for both Lipschitz and smooth objective functions, we show that using residual feedback can produce gradient estimates with much smaller variance compared to conventional one-point feedback methods. As a result, our regret bounds are much tighter compared to existing regret bounds for ZO with conventional one-point feedback, which suggests that ZO with residual feedback can better track the optimizer of online optimization problems. Additionally, our regret bounds rely on weaker assumptions than those used in conventional one-point feedback methods. Numerical experiments show that ZO with residual feedback significantly outperforms existing one-point feedback methods also in practice.

قيم البحث

اقرأ أيضاً

Boosting for Online Convex Optimization

123 - Elad Hazan , Karan Singh 2021

We consider the decision-making framework of online convex optimization with a very large number of experts. This setting is ubiquitous in contextual and reinforcement learning problems, where the size of the policy class renders enumeration and sear ch within the policy class infeasible. Instead, we consider generalizing the methodology of online boosting. We define a weak learning algorithm as a mechanism that guarantees multiplicatively approximate regret against a base class of experts. In this access model, we give an efficient boosting algorithm that guarantees near-optimal regret against the convex hull of the base class. We consider both full and partial (a.k.a. bandit) information feedback models. We also give an analogous efficient boosting algorithm for the i.i.d. statistical setting. Our results simultaneously generalize online boosting and gradient boosting guarantees to contextual learning model, online convex optimization and bandit linear optimization settings.

التعلم الآلي التعلم الالي

Inverse Multiobjective Optimization Through Online Learning

154 - Chaosheng Dong , Bo Zeng 2020

We study the problem of learning the objective functions or constraints of a multiobjective decision making model, based on a set of sequentially arrived decisions. In particular, these decisions might not be exact and possibly carry measurement nois e or are generated with the bounded rationality of decision makers. In this paper, we propose a general online learning framework to deal with this learning problem using inverse multiobjective optimization. More precisely, we develop two online learning algorithms with implicit update rules which can handle noisy data. Numerical results show that both algorithms can learn the parameters with great accuracy and are robust to noise.

التعلم الآلي التحسين والتحكم

Matrix-Free Preconditioning in Online Learning

197 - Ashok Cutkosky , Tamas Sarlos 2019

We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obt ained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorithm runs in the same time and space complexity as online gradient descent. Along the way we incorporate new techniques that mildly streamline and improve logarithmic factors in prior regret analyses. We conclude by benchmarking our algorithm on synthetic data and deep learning tasks.

التعلم الآلي التحسين والتحكم التعلم الالي

Policy Optimization as Online Learning with Mediator Feedback

125 - Alberto Maria Metelli , Matteo Papini , Pierluca DOro 2020

Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available informati on, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST to compact policy spaces. Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Online optimal task offloading with one-bit feedback

57 - Shangshu Zhao , Zhaowei Zhu , Fuqian Yang 2018

Task offloading is an emerging technology in fog-enabled networks. It allows users to transmit tasks to neighbor fog nodes so as to utilize the computing resources of the networks. In this paper, we investigate a stochastic task offloading model and propose a multi-armed bandit framework to formulate this model. We consider the fact that different helper nodes prefer different kinds of tasks. Further, we assume each helper node just feeds back one-bit information to the task node to indicate the level of happiness. The key challenge of this problem lies in the exploration-exploitation tradeoff. We thus implement a UCB-type algorithm to maximize the long-term happiness metric. Numerical simulations are given in the end of the paper to corroborate our strategy.

التعلم الآلي التعلم الالي