POND: Pessimistic-Optimistic oNline Dispatching

57 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xin Liu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xin Liu - Bin Li - Pengyi Shi

التعلم الآلي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper considers constrained online dispatching with unknown arrival, reward and constraint distributions. We propose a novel online dispatching algorithm, named POND, standing for Pessimistic-Optimistic oNline Dispatching, which achieves $O(sqrt{T})$ regret and $O(1)$ constraint violation. Both bounds are sharp. Our experiments on synthetic and real datasets show that POND achieves low regret with minimal constraint violations.

قيم البحث

58 - Xin Liu , Bin Li , Pengyi Shi 2021

This paper considers stochastic linear bandits with general nonlinear constraints. The objective is to maximize the expected cumulative reward over horizon $T$ subject to a set of constraints in each round $tauleq T$. We propose a pessimistic-optimis tic algorithm for this problem, which is efficient in two aspects. First, the algorithm yields $tilde{cal O}left(left(frac{K^{0.75}}{delta}+dright)sqrt{tau}right)$ (pseudo) regret in round $tauleq T,$ where $K$ is the number of constraints, $d$ is the dimension of the reward feature space, and $delta$ is a Slaters constant; and zero constraint violation in any round $tau>tau,$ where $tau$ is independent of horizon $T.$ Second, the algorithm is computationally efficient. Our algorithm is based on the primal-dual approach in optimization and includes two components. The primal component is similar to unconstrained stochastic linear bandits (our algorithm uses the linear upper confidence bound algorithm (LinUCB)). The computational complexity of the dual component depends on the number of constraints, but is independent of the sizes of the contextual space, the action space, and the feature space. Thus, the overall computational complexity of our algorithm is similar to that of the linear UCB for unconstrained stochastic linear bandits.

التعلم الآلي

Continuous Online Learning and New Insights to Online Imitation Learning

105 - Jonathan Lee , Ching-An Cheng , Ken Goldberg 2019

Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup sometimes fails to capture certain regularity in online problems in practice. Motivated by this, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learners decisions. We show that COL covers and more appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of achieving sublinear dynamic regret. We prove that there is a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs, and we present a reduction from dynamic regret to both static regret and convergence rate of the associated EP. At the end, we specialize these new insights into online imitation learning and show improved understanding of its learning stability.

التعلم الآلي التحسين والتحكم التعلم الالي

Matrix-Free Preconditioning in Online Learning

197 - Ashok Cutkosky , Tamas Sarlos 2019

We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obt ained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorithm runs in the same time and space complexity as online gradient descent. Along the way we incorporate new techniques that mildly streamline and improve logarithmic factors in prior regret analyses. We conclude by benchmarking our algorithm on synthetic data and deep learning tasks.

التعلم الآلي التحسين والتحكم التعلم الالي

Inverse Multiobjective Optimization Through Online Learning

154 - Chaosheng Dong , Bo Zeng 2020

We study the problem of learning the objective functions or constraints of a multiobjective decision making model, based on a set of sequentially arrived decisions. In particular, these decisions might not be exact and possibly carry measurement nois e or are generated with the bounded rationality of decision makers. In this paper, we propose a general online learning framework to deal with this learning problem using inverse multiobjective optimization. More precisely, we develop two online learning algorithms with implicit update rules which can handle noisy data. Numerical results show that both algorithms can learn the parameters with great accuracy and are robust to noise.

التعلم الآلي التحسين والتحكم

Practical Open-Loop Optimistic Planning

52 - Edouard Leurent , Odalric-Ambrym Maillard 2019

We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies - i.e. sequences of actions - and under budget constraint. In this setting, the Open-Loop Optimisti c Planning (OLOP) algorithm enjoys good theoretical guarantees but is overly conservative in practice, as we show in numerical experiments. We propose a modified version of the algorithm with tighter upper-confidence bounds, KLOLOP, that leads to better practical performances while retaining the sample complexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.

التعلم الآلي التعلم الالي