Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Just a Momentum: Analytical Study of Momentum-Based Acceleration Methods in Paradigmatic High-Dimensional Non-Convex Problems

80 0 0.0 ( 0 )

Download Cite

Added by Stefano Sarao Mannelli

Publication date 2021

fields Physics Informatics Engineering

and research's language is English

Authors Stefano Sarao Mannelli - Pierfrancesco Urbani

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

When optimizing over loss functions it is common practice to use momentum-based accelerated methods rather than vanilla gradient-based method. Despite widely applied to arbitrary loss function, their behaviour in generically non-convex, high dimensional landscapes is poorly understood. In this work we used dynamical mean field theory techniques to describe analytically the average behaviour of these methods in a prototypical non-convex model: the (spiked) matrix-tensor model. We derive a closed set of equations that describe the behaviours of several algorithms including heavy-ball momentum and Nesterov acceleration. Additionally we characterize the evolution of a mathematically equivalent physical system of massive particles relaxing toward the bottom of an energetic landscape. Under the correct mapping the two dynamics are equivalent and it can be noticed that having a large mass increases the effective time step of the heavy ball dynamics leading to a speed up.

rate research

High dimensional optimization under non-convex excluded volume constraints

91 - Antonio Sclocchi , Pierfrancesco Urbani 2021

We consider high dimensional random optimization problems where the dynamical variables are subjected to non-convex excluded volume constraints. We focus on the case in which the cost function is a simple quadratic cost and the excluded volume constraints are modeled by a perceptron constraint satisfaction problem. We show that depending on the density of constraints, one can have different situations. If the number of constraints is small, one typically has a phase where the ground state of the cost function is unique and sits on the boundary of the island of configurations allowed by the constraints. In this case there is an hypostatic number of constraints that are marginally satisfied. If the number of constraints is increased one enters in a glassy phase where the cost function has many local minima sitting again on the boundary of the regions of allowed configurations. At the phase transition point the total number of constraints that are marginally satisfied becomes equal to the number of degrees of freedom in the problem and therefore we say that these minima are isostatic. We conjecture that increasing further the constraints the system stays isostatic up to the point where the volume of available phase space shrinks to zero. We derive our results using the replica method and we also analyze a dynamical algorithm, the Karush-Kuhn-Tucker algorithm, through dynamical mean field theory and we show how to recover the results of the replica approach in the replica symmetric phase.

Disordered Systems and Neural Networks Statistical Mechanics

Distributed Stochastic Non-Convex Optimization: Momentum-Based Variance Reduction

169 - Prashant Khanduri , Pranay Sharma , Swatantra Kafle 2020

In this work, we propose a distributed algorithm for stochastic non-convex optimization. We consider a worker-server architecture where a set of $K$ worker nodes (WNs) in collaboration with a server node (SN) jointly aim to minimize a global, potentially non-convex objective function. The objective function is assumed to be the sum of local objective functions available at each WN, with each node having access to only the stochastic samples of its local objective function. In contrast to the existing approaches, we employ a momentum based single loop distributed algorithm which eliminates the need of computing large batch size gradients to achieve variance reduction. We propose two algorithms one with adaptive and the other with non-adaptive learning rates. We show that the proposed algorithms achieve the optimal computational complexity while attaining linear speedup with the number of WNs. Specifically, the algorithms reach an $epsilon$-stationary point $x_a$ with $mathbb{E}| abla f(x_a) | leq tilde{O}(K^{-1/3}T^{-1/2} + K^{-1/3}T^{-1/3})$ in $T$ iterations, thereby requiring $tilde{O}(K^{-1} epsilon^{-3})$ gradient computations at each WN. Moreover, our approach does not assume identical data distributions across WNs making the approach general enough for federated learning applications.

Optimization and Control Distributed Parallel and Cluster Computing

Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances

153 - Bugra Can , Mert Gurbuzbalaban , Lingjiong Zhu 2019

Momentum methods such as Polyaks heavy ball (HB) method, Nesterovs accelerated gradient (AG) as well as accelerated projected gradient (APG) method have been commonly used in machine learning practice, but their performance is quite sensitive to noise in the gradients. We study these methods under a first-order stochastic oracle model where noisy estimates of the gradients are available. For strongly convex problems, we show that the distribution of the iterates of AG converges with the accelerated $O(sqrt{kappa}log(1/varepsilon))$ linear rate to a ball of radius $varepsilon$ centered at a unique invariant distribution in the 1-Wasserstein metric where $kappa$ is the condition number as long as the noise variance is smaller than an explicit upper bound we can provide. Our analysis also certifies linear convergence rates as a function of the stepsize, momentum parameter and the noise variance; recovering the accelerated rates in the noiseless case and quantifying the level of noise that can be tolerated to achieve a given performance. In the special case of strongly convex quadratic objectives, we can show accelerated linear rates in the $p$-Wasserstein metric for any $pgeq 1$ with improved sensitivity to noise for both AG and HB through a non-asymptotic analysis under some additional assumptions on the noise structure. Our analysis for HB and AG also leads to improved non-asymptotic convergence bounds in suboptimality for both deterministic and stochastic settings which is of independent interest. To the best of our knowledge, these are the first linear convergence results for stochastic momentum methods under the stochastic oracle model. We also extend our results to the APG method and weakly convex functions showing accelerated rates when the noise magnitude is sufficiently small.

Machine Learning Machine Learning Optimization and Control

Screening Rules for Convex Problems

92 - Anant Raj , Jakob Olbrich , Bernd Gartner 2016

We propose a new framework for deriving screening rules for convex optimization problems. Our approach covers a large class of constrained and penalized optimization formulations, and works in two steps. First, given any approximate point, the structure of the objective function and the duality gap is used to gather information on the optimal solution. In the second step, this information is used to produce screening rules, i.e. safely identifying unimportant weight variables of the optimal solution. Our general framework leads to a large variety of useful existing as well as new screening rules for many applications. For example, we provide new screening rules for general simplex and $L_1$-constrained problems, Elastic Net, squared-loss Support Vector Machines, minimum enclosing ball, as well as structured norm regularized problems, such as group lasso.

Optimization and Control Machine Learning Machine Learning

Lock-Free Optimization for Non-Convex Problems

409 - Shen-Yi Zhao , Gong-Duo Zhang , Wu-Jun Li 2016

Stochastic gradient descent~(SGD) and its variants have attracted much attention in machine learning due to their efficiency and effectiveness for optimization. To handle large-scale problems, researchers have recently proposed several lock-free strategy based parallel SGD~(LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results.

Machine Learning Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Just a Momentum: Analytical Study of Momentum-Based Acceleration Methods in Paradigmatic High-Dimensional Non-Convex Problems

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions