Escaping Saddle Points in Distributed Newtons Method with Communication efficiency and Byzantine Resilience

347 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Raj Kumar Maity

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Avishek Ghosh - Raj Kumar Maity - Arya Mazumdar

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the problem of optimizing a non-convex loss function (with saddle points) in a distributed framework in the presence of Byzantine machines. We consider a standard distributed setting with one central machine (parameter server) communicating with many worker machines. Our proposed algorithm is a variant of the celebrated cubic-regularized Newton method of Nesterov and Polyak cite{nest}, which avoids saddle points efficiently and converges to local minima. Furthermore, our algorithm resists the presence of Byzantine machines, which may create emph{fake local minima} near the saddle points of the loss function, also known as saddle-point attack. We robustify the cubic-regularized Newton algorithm such that it avoids the saddle points and the fake local minimas efficiently. Furthermore, being a second order algorithm, the iteration complexity is much lower than its first order counterparts, and thus our algorithm communicates little with the parameter server. We obtain theoretical guarantees for our proposed scheme under several settings including approximate (sub-sampled) gradients and Hessians. Moreover, we validate our theoretical findings with experiments using standard datasets and several types of Byzantine attacks.

قيم البحث

64 - Tuyen Trung Truong , Tat Dat To , Tuan Hang Nguyen 2020

We propose in this paper New Q-Newtons method. The update rule is very simple conceptually, for example $x_{n+1}=x_n-w_n$ where $w_n=pr_{A_n,+}(v_n)-pr_{A_n,-}(v_n)$, with $A_n= abla ^2f(x_n)+delta _n|| abla f(x_n)||^2.Id$ and $v_n=A_n^{-1}. abla f(x _n)$. Here $delta _n$ is an appropriate real number so that $A_n$ is invertible, and $pr_{A_n,pm}$ are projections to the vector subspaces generated by eigenvectors of positive (correspondingly negative) eigenvalues of $A_n$. The main result of this paper roughly says that if $f$ is $C^3$ (can be unbounded from below) and a sequence ${x_n}$, constructed by the New Q-Newtons method from a random initial point $x_0$, {bf converges}, then the limit point is a critical point and is not a saddle point, and the convergence rate is the same as that of Newtons method. The first author has recently been successful incorporating Backtracking line search to New Q-Newtons method, thus resolving the convergence guarantee issue observed for some (non-smooth) cost functions. An application to quickly finding zeros of a univariate meromorphic function will be discussed. Various experiments are performed, against well known algorithms such as BFGS and Adaptive Cubic Regularization are presented.

التحسين والتحكم التعلم الآلي التحليل العددي

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

134 - Zhishuai Guo , Mingrui Liu , Zhuoning Yuan 2020

In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. Although distributed learning techniques have been investigated extensively in deep learning, they are not directly appl icable to stochastic AUC maximization with deep neural networks due to its striking differences from standard loss minimization problems (e.g., cross-entropy). Towards addressing this challenge, we propose and analyze a communication-efficient distributed optimization algorithm based on a {it non-convex concave} reformulation of the AUC maximization, in which the communication of both the primal variable and the dual variable between each worker and the parameter server only occurs after multiple steps of gradient-based updates in each worker. Compared with the naive parallel version of an existing algorithm that computes stochastic gradients at individual machines and averages them for updating the model parameters, our algorithm requires a much less number of communication rounds and still achieves a linear speedup in theory. To the best of our knowledge, this is the textbf{first} work that solves the {it non-convex concave min-max} problem for AUC maximization with deep neural networks in a communication-efficient distributed manner while still maintaining the linear speedup property in theory. Our experiments on several benchmark datasets show the effectiveness of our algorithm and also confirm our theory.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التحسين والتحكم

Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

67 - Damek Davis , Mateo Diaz , Dmitriy Drusvyatskiy 2021

Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbe d gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.

التحسين والتحكم التعلم الآلي التعلم الالي

New Q-Newtons method meets Backtracking line search: good convergence guarantee, saddle points avoidance, quadratic rate of convergence, and easy implementation

112 - Tuyen Trung Truong 2021

In a recent joint work, the author has developed a modification of Newtons method, named New Q-Newtons method, which can avoid saddle points and has quadratic rate of convergence. While good theoretical convergence guarantee has not been established for this method, experiments on small scale problems show that the method works very competitively against other well known modifications of Newtons method such as Adaptive Cubic Regularization and BFGS, as well as first order methods such as Unbounded Two-way Backtracking Gradient Descent. In this paper, we resolve the convergence guarantee issue by proposing a modification of New Q-Newtons method, named New Q-Newtons method Backtracking, which incorporates a more sophisticated use of hyperparameters and a Backtracking line search. This new method has very good theoretical guarantees, which for a {bf Morse function} yields the following (which is unknown for New Q-Newtons method): {bf Theorem.} Let $f:mathbb{R}^mrightarrow mathbb{R}$ be a Morse function, that is all its critical points have invertible Hessian. Then for a sequence ${x_n}$ constructed by New Q-Newtons method Backtracking from a random initial point $x_0$, we have the following two alternatives: i) $lim_{nrightarrowinfty}||x_n||=infty$, or ii) ${x_n}$ converges to a point $x_{infty}$ which is a {bf local minimum} of $f$, and the rate of convergence is {bf quadratic}. Moreover, if $f$ has compact sublevels, then only case ii) happens. As far as we know, for Morse functions, this is the best theoretical guarantee for iterative optimization algorithms so far in the literature. We have tested in experiments on small scale, with some further simplifie

التحسين والتحكم التعلم الآلي التحليل العددي

Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning

200 - Dong Yin , Yudong Chen , Kannan Ramchandran 2018

We study robust distributed learning that involves minimizing a non-convex loss function with saddle points. We consider the Byzantine setting where some worker machines have abnormal or even arbitrary and adversarial behavior. In this setting, the B yzantine machines may create fake local minima near a saddle point that is far away from any true local minimum, even when robust gradient estimators are used. We develop ByzantinePGD, a robust first-order algorithm that can provably escape saddle points and fake local minima, and converge to an approximate true local minimizer with low iteration complexity. As a by-product, we give a simpler algorithm and analysis for escaping saddle points in the usual non-Byzantine setting. We further discuss three robust gradient estimators that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in concrete statistical settings, and argue for their near-optimality in low and high dimensional regimes.

التعلم الآلي التشفير والأمن النظم الموزعة والتوازية والحوسبة العنقودية