Learning to Initialize Gradient Descent Using Gradient Descent

357 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kartik Ahuja

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Kartik Ahuja - Amit Dhurandhar - Kush R. Varshney

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Non-convex optimization problems are challenging to solve; the success and computational expense of a gradient descent algorithm or variant depend heavily on the initialization strategy. Often, either random initialization is used or initialization rules are carefully designed by exploiting the nature of the problem class. As a simple alternative to hand-crafted initialization rules, we propose an approach for learning good initialization rules from previous solutions. We provide theoretical guarantees that establish conditions that are sufficient in all cases and also necessary in some under which our approach performs better than random initialization. We apply our methodology to various non-convex problems such as generating adversarial examples, generating post hoc explanations for black-box machine learning models, and allocating communication spectrum, and show consistent gains over other initialization techniques.

قيم البحث

276 - Yutian Chen , Matthew W. Hoffman , Sergio Gomez Colmenarejo 2016

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of de rivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.

التعلم الالي التعلم الآلي

Gradient Descent: The Ultimate Optimizer

75 - Kartik Chandra , Erik Meijer , Samantha Andow 2019

Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizers hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introd uce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.

التعلم الآلي التعلم الالي

Power Gradient Descent

163 - Marco Baiesi 2019

The development of machine learning is promoting the search for fast and stable minimization algorithms. To this end, we suggest a change in the current gradient descent methods that should speed up the motion in flat regions and slow it down in stee p directions of the function to minimize. It is based on a power gradient, in which each component of the gradient is replaced by its versus-preserving $H$-th power, with $0<H<1$. We test three modern gradient descent methods fed by such variant and by standard gradients, finding the new version to achieve significantly better performances for the Nesterov accelerated gradient and AMSGrad. We also propose an effective new take on the ADAM algorithm, which includes power gradients with varying $H$.

التعلم الآلي الميكانيكا الإحصائية الأداء

Meta-Learning with Warped Gradient Descent

89 - Sebastian Flennerhag , Andrei A. Rusu , Razvan Pascanu andn Francesco Visin 2019

Learning an efficient update rule from data that promotes rapid learning of new tasks from the same distribution remains an open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural net work that directly produces updates or by attempting to learn better initialisations or scaling factors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-converging behaviour. On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta-gradients, leading to methods that can not scale beyond few-shot task adaptation. In this work, we propose Warped Gradient Descent (WarpGrad), a method that intersects these approaches to mitigate their limitations. WarpGrad meta-learns an efficiently parameterised preconditioning matrix that facilitates gradient descent across the task distribution. Preconditioning arises by interleaving non-linear layers, referred to as warp-layers, between the layers of a task-learner. Warp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. WarpGrad is computationally efficient, easy to implement, and can scale to arbitrarily large meta-learning problems. We provide a geometrical interpretation of the approach and evaluate its effectiveness in a variety of settings, including few-shot, standard supervised, continual and reinforcement learning.

التعلم الآلي الحوسبة العصبية والتطورية التعلم الالي

Beyond Gradient Descent for Regularized Segmentation Losses

133 - Dmitrii Marin , Meng Tang , Ismail Ben Ayed 2018

The simplicity of gradient descent (GD) made it the default method for training ever-deeper and complex neural networks. Both loss functions and architectures are often explicitly tuned to be amenable to this basic local optimization. In the context of weakly-supervised CNN segmentation, we demonstrate a well-motivated loss function where an alternative optimizer (ADM) achieves the state-of-the-art while GD performs poorly. Interestingly, GD obtains its best result for a smoother tuning of the loss function. The results are consistent across different network architectures. Our loss is motivated by well-understood MRF/CRF regularization models in shallow segmentation and their known global solvers. Our work suggests that network design/training should pay more attention to optimization methods.

التعلم الآلي التعلم الالي