Nesterovs Accelerated Gradient and Momentum as approximations to Regularised Update Descent

65 0 0.0 ( 0 )

Download Cite

Added by David Barber

Publication date 2016

fields Mathematical Statistics Informatics Engineering

and research's language is English

Authors Aleksandar Botev - Guy Lever - David Barber

Machine Learning Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present a unifying framework for adapting the update direction in gradient-based iterative optimization methods. As natural special cases we re-derive classical momentum and Nesterovs accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly than either Nesterovs algorithm or the classical momentum algorithm.

rate research

Accelerated Stochastic Gradient Descent for Minimizing Finite Sums

771 - Atsushi Nitanda 2015

We propose an optimization method for minimizing the finite sums of smooth convex functions. Our method incorporates an accelerated gradient descent (AGD) and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. Unlike SVRG, our method can be directly applied to non-strongly and strongly convex problems. We show that our method achieves a lower overall complexity than the recently proposed methods that supports non-strongly convex problems. Moreover, this method has a fast rate of convergence for strongly convex problems. Our experiments show the effectiveness of our method.

Machine Learning Machine Learning

Learning to Learn without Gradient Descent by Gradient Descent

276 - Yutian Chen , Matthew W. Hoffman , Sergio Gomez Colmenarejo 2016

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.

Machine Learning Machine Learning

Stein Variational Gradient Descent Without Gradient

239 - Jun Han , Qiang Liu 2018

Stein variational gradient decent (SVGD) has been shown to be a powerful approximate inference algorithm for complex distributions. However, the standard SVGD requires calculating the gradient of the target density and cannot be applied when the gradient is unavailable. In this work, we develop a gradient-free variant of SVGD (GF-SVGD), which replaces the true gradient with a surrogate gradient, and corrects the induced bias by re-weighting the gradients in a proper form. We show that our GF-SVGD can be viewed as the standard SVGD with a special choice of kernel, and hence directly inherits the theoretical properties of SVGD. We shed insights on the empirical choice of the surrogate gradient and propose an annealed GF-SVGD that leverages the idea of simulated annealing to improve the performance on high dimensional complex distributions. Empirical studies show that our method consistently outperforms a number of recent advanced gradient-free MCMC methods.

Machine Learning Machine Learning

Descent-to-Delete: Gradient-Based Methods for Machine Unlearning

118 - Seth Neel , Aaron Roth , Saeed Sharifi-Malvajerdi 2020

We study the data deletion problem for convex models. By leveraging techniques from convex optimization and reservoir sampling, we give the first data deletion algorithms that are able to handle an arbitrarily long sequence of adversarial updates while promising both per-deletion run-time and steady-state error that do not grow with the length of the update sequence. We also introduce several new conceptual distinctions: for example, we can ask that after a deletion, the entire state maintained by the optimization algorithm is statistically indistinguishable from the state that would have resulted had we retrained, or we can ask for the weaker condition that only the observable output is statistically indistinguishable from the observable output that would have resulted from retraining. We are able to give more efficient deletion algorithms under this weaker deletion criterion.

Machine Learning Machine Learning

Decentralised Learning with Random Features and Distributed Gradient Descent

116 - Dominic Richards , Patrick Rebeschini , Lorenzo Rosasco 2020

We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation across agents that allows to overcome previous difficulties in implementing Decentralised Kernel Regression. Under standard source and capacity assumptions, we establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. By tuning these parameters, we obtain statistical rates that are minimax optimal with respect to the total number of samples in the network. The algorithm provides a linear improvement over single machine Gradient Descent in memory cost and, when agents hold enough data with respect to the network size and inverse spectral gap, a linear speed-up in computational runtime for any network topology. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.

Machine Learning Machine Learning Statistics Theory

Nesterovs Accelerated Gradient and Momentum as approximations to Regularised Update Descent

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions