Non-asymptotic estimates for TUSLA algorithm for non-convex learning with applications to neural networks with ReLU activation function

62 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ariel Neufeld

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Dong-Young Lim - Ariel Neufeld - Sotirios Sabanis

التحسين والتحكم التعلم الآلي التحليل العددي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider non-convex stochastic optimization problems where the objective functions have super-linearly growing and discontinuous stochastic gradients. In such a setting, we provide a non-asymptotic analysis for the tamed unadjusted stochastic Langevin algorithm (TUSLA) introduced in Lovas et al. (2021). In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result enables us to further derive non-asymptotic estimates for the expected excess risk. To illustrate the applicability of the main results, we consider an example from transfer learning with ReLU neural networks, which represents a key paradigm in machine learning. Numerical experiments are presented for the aforementioned example which supports our theoretical findings. Hence, in this setting, we demonstrate both theoretically and numerically that the TUSLA algorithm can solve the optimization problem involving neural networks with ReLU activation function. Besides, we provide simulation results for synthetic examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla) SGD, may fail to find the minimizer of the objective functions due to the super-linear growth and the discontinuity of the corresponding stochastic gradient, while the TUSLA algorithm converges rapidly to the optimal solution.

قيم البحث

109 - Attila Lovas , Iosif Lytras , Miklos Rasonyi 2020

Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. Typically, the gradient of any such loss function fails to be dissipative making the use of widely-accepted (stochastic) gradient descent methods problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithms convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.

التعلم الآلي التحسين والتحكم الاحتمالات

Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization

194 - Chen Xu , Zhouchen Lin , Zhenyu Zhao 2015

We propose a new majorization-minimization (MM) method for non-smooth and non-convex programs, which is general enough to include the existing MM methods. Besides the local majorization condition, we only require that the difference between the direc tional derivatives of the objective function and its surrogate function vanishes when the number of iterations approaches infinity, which is a very weak condition. So our method can use a surrogate function that directly approximates the non-smooth objective function. In comparison, all the existing MM methods construct the surrogate function by approximating the smooth component of the objective function. We apply our relaxed MM methods to the robust matrix factorization (RMF) problem with different regularizations, where our locally majorant algorithm shows advantages over the state-of-the-art approaches for RMF. This is the first algorithm for RMF ensuring, without extra assumptions, that any limit point of the iterates is a stationary point.

التحسين والتحكم التعلم الآلي التحليل العددي

Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization

73 - Xuefeng Gao , Mert Gurbuzbalaban , Lingjiong Zhu 2018

Langevin dynamics (LD) has been proven to be a powerful technique for optimizing a non-convex objective as an efficient algorithm to find local minima while eventually visiting a global minimum on longer time-scales. LD is based on the first-order La ngevin diffusion which is reversible in time. We study two variants that are based on non-reversible Langevin diffusions: the underdamped Langevin dynamics (ULD) and the Langevin dynamics with a non-symmetric drift (NLD). Adopting the techniques of Tzen, Liang and Raginsky (2018) for LD to non-reversible diffusions, we show that for a given local minimum that is within an arbitrary distance from the initialization, with high probability, either the ULD trajectory ends up somewhere outside a small neighborhood of this local minimum within a recurrence time which depends on the smallest eigenvalue of the Hessian at the local minimum or they enter this neighborhood by the recurrence time and stay there for a potentially exponentially long escape time. The ULD algorithms improve upon the recurrence time obtained for LD in Tzen, Liang and Raginsky (2018) with respect to the dependency on the smallest eigenvalue of the Hessian at the local minimum. Similar result and improvement are obtained for the NLD algorithm. We also show that non-reversible variants can exit the basin of attraction of a local minimum faster in discrete time when the objective has two local minima separated by a saddle point and quantify the amount of improvement. Our analysis suggests that non-reversible Langevin algorithms are more efficient to locate a local minimum as well as exploring the state space. Our analysis is based on the quadratic approximation of the objective around a local minimum. As a by-product of our analysis, we obtain optimal mixing rates for quadratic objectives in the 2-Wasserstein distance for two non-reversible Langevin algorithms we consider.

التحسين والتحكم التعلم الآلي التحليل العددي

GAMA: A Novel Algorithm for Non-Convex Integer Programs

174 - Hedayat Alghassi , Raouf Dridi , Sridhar Tayur 2019

Inspired by the decomposition in the hybrid quantum-classical optimization algorithm we introduced in arXiv:1902.04215, we propose here a new (fully classical) approach to solving certain non-convex integer programs using Graver bases. This method is well suited when (a) the constraint matrix $A$ has a special structure so that its Graver basis can be computed systematically, (b) several feasible solutions can also be constructed easily and (c) the objective function can be viewed as many convex functions quilted together. Classes of problems that satisfy these conditions include Cardinality Boolean Quadratic Problems (CBQP), Quadratic Semi-Assignment Problems (QSAP) and Quadratic Assignment Problems (QAP). Our Graver Augmented Multi-seed Algorithm (GAMA) utilizes augmentation along Graver basis elements (the improvement direction is obtained by comparing objective function values) from these multiple initial feasible solutions. We compare our approach with a best-in-class commercially available solver (Gurobi). Sensitivity analysis indicates that the rate at which GAMA slows down as the problem size increases is much lower than that of Gurobi. We find that for several instances of practical relevance, GAMA not only vastly outperforms in terms of time to find the optimal solution (by two or three orders of magnitude), but also finds optimal solutions within minutes when the commercial solver is not able to do so in 4 or 10 hours (depending on the problem class) in several cases.

التحسين والتحكم الرياضيات المتقطعة بنى وهياكل البيانات والخوارزميات

Accelerated Primal-Dual Algorithm for Distributed Non-convex Optimization

146 - Shengjun Zhang , Colleen P. Bailey 2021

This paper investigates accelerating the convergence of distributed optimization algorithms on non-convex problems. We propose a distributed primal-dual stochastic gradient descent~(SGD) equipped with powerball method to accelerate. We show that the proposed algorithm achieves the linear speedup convergence rate $mathcal{O}(1/sqrt{nT})$ for general smooth (possibly non-convex) cost functions. We demonstrate the efficiency of the algorithm through numerical experiments by training two-layer fully connected neural networks and convolutional neural networks on the MNIST dataset to compare with state-of-the-art distributed SGD algorithms and centralized SGD algorithms.

التحسين والتحكم