ﻻ يوجد ملخص باللغة العربية
We propose a new variant of AMSGrad, a popular adaptive gradient based optimization algorithm widely used for training deep neural networks. Our algorithm adds prior knowledge about the sequence of consecutive mini-batch gradients and leverages its underlying structure making the gradients sequentially predictable. By exploiting the predictability and ideas from optimistic online learning, the proposed algorithm can accelerate the convergence and increase sample efficiency. After establishing a tighter upper bound under some convexity conditions on the regret, we offer a complimentary view of our algorithm which generalizes the offline and stochastic version of nonconvex optimization. In the nonconvex case, we establish a non-asymptotic convergence bound independently of the initialization. We illustrate the practical speedup on several deep learning models via numerical experiments.
Existing nonconvex statistical optimization theory and methods crucially rely on the correct specification of the underlying true statistical models. To address this issue, we take a first step towards taming model misspecification by studying the hi
Bayesian coresets have emerged as a promising approach for implementing scalable Bayesian inference. The Bayesian coreset problem involves selecting a (weighted) subset of the data samples, such that the posterior inference using the selected subset
Solving statistical learning problems often involves nonconvex optimization. Despite the empirical success of nonconvex statistical optimization methods, their global dynamics, especially convergence to the desirable local minima, remain less well un
The Expectation Maximization (EM) algorithm is a key reference for inference in latent variable models; unfortunately, its computational cost is prohibitive in the large scale learning setting. In this paper, we propose an extension of the Stochastic
We formulate gradient-based Markov chain Monte Carlo (MCMC) sampling as optimization on the space of probability measures, with Kullback-Leibler (KL) divergence as the objective functional. We show that an underdamped form of the Langevin algorithm p