Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

173 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chaoyue Liu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Chaoyue Liu - Libin Zhu - Mikhail Belkin

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL$^$, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL$^$ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL$^$-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL$^$ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL$^*$ condition applicable to almost over-parameterized systems.

قيم البحث

123 - Simon S. Du , Xiyu Zhai , Barnabas Poczos 2018

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surpr ising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

التعلم الآلي التحسين والتحكم التعلم الالي

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes

85 - James Lucas , Juhan Bae , Michael R. Zhang 2021

Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective. This Monotonic Linear Interpolation (MLI) property, first observed by Goodfellow et al. (2014) persists in spite of the non-convex objectives and highly non-linear training dynamics of neural networks. Extending this work, we evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network - providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g. network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization. The MLI property raises important questions about the loss landscape geometry of neural networks and highlights the need to further study their global properties.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Exact Combinatorial Optimization with Graph Convolutional Neural Networks

113 - Maxime Gasse , Didier Chetelat , Nicola Ferroni 2019

Combinatorial optimization problems are typically tackled by the branch-and-bound paradigm. We propose a new graph convolutional neural network model for learning branch-and-bound variable selection policies, which leverages the natural variable-cons traint bipartite graph representation of mixed-integer linear programs. We train our model via imitation learning from the strong branching expert rule, and demonstrate on a series of hard problems that our approach produces policies that improve upon state-of-the-art machine-learning methods for branching and generalize to instances significantly larger than seen during training. Moreover, we improve for the first time over expert-designed branching rules implemented in a state-of-the-art solver on large problems. Code for reproducing all the experiments can be found at https://github.com/ds4dm/learn2branch.

التعلم الآلي التحسين والتحكم التعلم الالي

Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

137 - Yuanzhi Li , Tengyu Ma , Hongyang R. Zhang 2020

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $xinmathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{star}(x) = a^{top}|W^{star}x|$, where $ainmathbb{R}^d$ is a nonnegative vector and $W^{star} inmathbb{R}^{dtimes d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $Omega(1 / d)$.

التعلم الآلي التحسين والتحكم التعلم الالي

Asymptotic convergence rate of Dropout on shallow linear neural networks

101 - Albert Senen-Cerda , Jaron Sanders 2020

We analyze the convergence rate of gradient flows on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks (NNs) - which can also be viewed as doing matrix factorization using a particular regula rizer. Dropout algorithms such as these are thus regularization techniques that use 0,1-valued random variables to filter weights during training in order to avoid coadaptation of features. By leveraging a recent result on nonconvex optimization and conducting a careful analysis of the set of minimizers as well as the Hessian of the loss function, we are able to obtain (i) a local convergence proof of the gradient flow and (ii) a bound on the convergence rate that depends on the data, the dropout probability, and the width of the NN. Finally, we compare this theoretical bound to numerical simulations, which are in qualitative agreement with the convergence bound and match it when starting sufficiently close to a minimizer.

التعلم الآلي التحسين والتحكم التعلم الالي