Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks

52 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jonas Kohler

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Antonio Orvieto - Jonas Kohler - Dario Pavllo

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly initialized neural networks. Leveraging an in-depth analysis of neural chains, we first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth), even when initialized with the popular Xavier and He initializations. Second, we extend the analysis to second-order derivatives and show that random i.i.d. initialization also gives rise to Hessian matrices with eigenspectra that vanish as networks grow in depth. Whenever this happens, optimizers are initialized in a very flat, saddle point-like plateau, which is particularly hard to escape with stochastic gradient descent (SGD) as its escaping time is inversely related to curvature. We believe that this observation is crucial for fully understanding (a) historical difficulties of training deep nets with vanilla SGD, (b) the success of adaptive gradient methods (which naturally adapt to curvature and thus quickly escape flat plateaus) and (c) the effectiveness of modern architectural components like residual connections and normalization layers.

قيم البحث

77 - Kyle Luther , H. Sebastian Seung 2019

Before training a neural net, a classic rule of thumb is to randomly initialize the weights so the variance of activations is preserved across layers. This is traditionally interpreted using the total variance due to randomness in both weights emph{a nd} samples. Alternatively, one can interpret the rule of thumb as preservation of the variance over samples for a fixed network. The two interpretations differ little for a shallow net, but the difference is shown to grow with depth for a deep ReLU net by decomposing the total variance into the network-averaged sum of the sample variance and square of the sample mean. We demonstrate that even when the total variance is preserved, the sample variance decays in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width. We show that Batch Normalization eliminates this decay and provide empirical evidence that preserving the sample variance instead of only the total variance at initialization time can have an impact on the training dynamics of a deep network.

التعلم الآلي التعلم الالي

Training Learned Optimizers with Randomly Initialized Learned Optimizers

240 - Luke Metz , C. Daniel Freeman , Niru Maheswaranathan 2021

Learned optimizers are increasingly effective, with performance exceeding that of hand designed optimizers such as Adam~citep{kingma2014adam} on specific tasks citep{metz2019understanding}. Despite the potential gains available, in current work the m eta-training (or `outer-training) of the learned optimizer is performed by a hand-designed optimizer, or by an optimizer trained by a hand-designed optimizer citep{metz2020tasks}. We show that a population of randomly initialized learned optimizers can be used to train themselves from scratch in an online fashion, without resorting to a hand designed optimizer in any part of the process. A form of population based training is used to orchestrate this self-training. Although the randomly initialized optimizers initially make slow progress, as they improve they experience a positive feedback loop, and become rapidly more effective at training themselves. We believe feedback loops of this type, where an optimizer improves itself, will be important and powerful in the future of machine learning. These methods not only provide a path towards increased performance, but more importantly relieve research and engineering effort.

التعلم الآلي الحوسبة العصبية والتطورية

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

177 - Boris Ginsburg , Patrice Castonguay , Oleksii Hrinchuk 2019

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and langua ge modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.

التعلم الآلي التعلم الالي

Deep Networks with Adaptive Nystrom Approximation

73 - Luc Giffon , Stephane Ayache (QARMA 2019

Recent work has focused on combining kernel methods and deep learning to exploit the best of the two approaches. Here, we introduce a new architecture of neural networks in which we replace the top dense layers of standard convolutional architectures with an approximation of a kernel function by relying on the Nystr{o}m approximation. Our approach is easy and highly flexible. It is compatible with any kernel function and it allows exploiting multiple kernels. We show that our architecture has the same performance than standard architecture on datasets like SVHN and CIFAR100. One benefit of the method lies in its limited number of learnable parameters which makes it particularly suited for small training set sizes, e.g. from 5 to 20 samples per class.

التعلم الآلي التعلم الالي

On the Expressive Power of Deep Polynomial Neural Networks

348 - Joe Kileel , Matthew Trager , Joan Bruna 2019

We study deep neural networks with polynomial activations, particularly their expressive power. For a fixed architecture and activation degree, a polynomial neural network defines an algebraic map from weights to polynomials. The image of this map is the functional space associated to the network, and it is an irreducible algebraic variety upon taking closure. This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks. We obtain several theoretical results regarding this dimension as a function of architecture, including an exact formula for high activation degrees, as well as upper and lower bounds on layer widths in order for deep polynomials networks to fill the ambient functional space. We also present computational evidence that it is profitable in terms of expressiveness for layer widths to increase monotonically and then decrease monotonically. Finally, we link our study to favorable optimization properties when training weights, and we draw intriguing connections with tensor and polynomial decompositions.

التعلم الآلي الحوسبة العصبية والتطورية الهندسة الجبرية