Convergence rates of sub-sampled Newton methods

60 0 0.0 ( 0 )

Download Cite

Added by Murat A. Erdogdu

Publication date 2015

fields Mathematical Statistics

and research's language is English

Authors Murat A. Erdogdu - Andrea Montanari

Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We consider the problem of minimizing a sum of $n$ functions over a convex parameter set $mathcal{C} subset mathbb{R}^p$ where $ngg pgg 1$. In this regime, algorithms which utilize sub-sampling techniques are known to be effective. In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newtons method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems. Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios.

rate research

Sub-sampled Newton Methods with Non-uniform Sampling

183 - Peng Xu , Jiyan Yang , Farbod Roosta-Khorasani 2016

We consider the problem of finding the minimizer of a convex function $F: mathbb R^d rightarrow mathbb R$ of the form $F(w) := sum_{i=1}^n f_i(w) + R(w)$ where a low-rank factorization of $ abla^2 f_i(w)$ is readily available. We consider the regime where $n gg d$. As second-order methods prove to be effective in finding the minimizer to a high-precision, in this work, we propose randomized Newton-type algorithms that exploit textit{non-uniform} sub-sampling of ${ abla^2 f_i(w)}_{i=1}^{n}$, as well as inexact updates, as means to reduce the computational complexity. Two non-uniform sampling distributions based on {it block norm squares} and {it block partial leverage scores} are considered in order to capture important terms among ${ abla^2 f_i(w)}_{i=1}^{n}$. We show that at each iteration non-uniformly sampling at most $mathcal O(d log d)$ terms from ${ abla^2 f_i(w)}_{i=1}^{n}$ is sufficient to achieve a linear-quadratic convergence rate in $w$ when a suitable initial point is provided. In addition, we show that our algorithms achieve a lower computational complexity and exhibit more robustness and better dependence on problem specific quantities, such as the condition number, compared to similar existing methods, especially the ones based on uniform sampling. Finally, we empirically demonstrate that our methods are at least twice as fast as Newtons methods with ridge logistic regression on several real datasets.

Optimization and Control Machine Learning

Revisit of Spectral Bundle Methods: Primal-dual (Sub)linear Convergence Rates

77 - Lijun Ding , Benjamin Grimmer 2020

The spectral bundle method proposed by Helmberg and Rendl is well established for solving large scale semidefinite programs (SDP) thanks to its low per iteration computational complexity and strong practical performance. In this paper, we revisit this classic method showing it achieves sublinear convergence rates in terms of both primal and dual SDPs under merely strong duality. Prior to this work, only limited dual guarantees were known. Moreover, we develop a novel variant, called the block spectral bundle method (Block-Spec), which not only enjoys the same convergence rate and low per iteration complexity, but also speeds up to linear convergence when the SDP admits strict complementarity. Numerically, we demonstrate the effectiveness of both methods, confirming our theoretical findings that the block spectral bundle method can substantially speed up convergence.

Optimization and Control

Convergence Guarantees for Adaptive Bayesian Quadrature Methods

206 - Motonobu Kanagawa , Philipp Hennig 2019

Adaptive Bayesian quadrature (ABQ) is a powerful approach to numerical integration that empirically compares favorably with Monte Carlo integration on problems of medium dimensionality (where non-adaptive quadrature is not competitive). Its key ingredient is an acquisition function that changes as a function of previously collected values of the integrand. While this adaptivity appears to be empirically powerful, it complicates analysis. Consequently, there are no theoretical guarantees so far for this class of methods. In this work, for a broad class of adaptive Bayesian quadrature methods, we prove consistency, deriving non-tight but informative convergence rates. To do so we introduce a new concept we call weak adaptivity. Our results identify a large and flexible class of adaptive Bayesian quadrature rules as consistent, within which practitioners can develop empirically efficient methods.

Machine Learning Machine Learning Numerical Analysis

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

221 - Atsushi Nitanda , Taiji Suzuki 2018

We consider stochastic gradient descent and its averaging variant for binary classification problems in a reproducing kernel Hilbert space. In the traditional analysis using a consistency property of loss functions, it is known that the expected classification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition but provided theoretical analysis was limited to the squared loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions. As for the averaged stochastic gradient descent, we show that the same convergence rate holds from the early phase of training. In experiments, we verify our analyses on the $L_2$-regularized logistic regression.

Machine Learning Machine Learning Optimization and Control

Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances

153 - Bugra Can , Mert Gurbuzbalaban , Lingjiong Zhu 2019

Momentum methods such as Polyaks heavy ball (HB) method, Nesterovs accelerated gradient (AG) as well as accelerated projected gradient (APG) method have been commonly used in machine learning practice, but their performance is quite sensitive to noise in the gradients. We study these methods under a first-order stochastic oracle model where noisy estimates of the gradients are available. For strongly convex problems, we show that the distribution of the iterates of AG converges with the accelerated $O(sqrt{kappa}log(1/varepsilon))$ linear rate to a ball of radius $varepsilon$ centered at a unique invariant distribution in the 1-Wasserstein metric where $kappa$ is the condition number as long as the noise variance is smaller than an explicit upper bound we can provide. Our analysis also certifies linear convergence rates as a function of the stepsize, momentum parameter and the noise variance; recovering the accelerated rates in the noiseless case and quantifying the level of noise that can be tolerated to achieve a given performance. In the special case of strongly convex quadratic objectives, we can show accelerated linear rates in the $p$-Wasserstein metric for any $pgeq 1$ with improved sensitivity to noise for both AG and HB through a non-asymptotic analysis under some additional assumptions on the noise structure. Our analysis for HB and AG also leads to improved non-asymptotic convergence bounds in suboptimality for both deterministic and stochastic settings which is of independent interest. To the best of our knowledge, these are the first linear convergence results for stochastic momentum methods under the stochastic oracle model. We also extend our results to the APG method and weakly convex functions showing accelerated rates when the noise magnitude is sufficiently small.

Machine Learning Machine Learning Optimization and Control