Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

The Implicit Bias of Gradient Descent on Separable Data

248 0 0.0 ( 0 )

Download Cite

Added by Daniel Soudry

Publication date 2017

fields Mathematical Statistics Informatics Engineering

and research's language is English

Authors Daniel Soudry - Elad Hoffer - Mor Shpigel Nacson

Machine Learning Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization n more complex models and with other optimization methods.

rate research

Inductive Bias of Gradient Descent based Adversarial Training on Separable Data

278 - Yan Li , Ethan X.Fang , Huan Xu 2019

Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its inductive bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that when the adversarial perturbation during training has bounded $ell_2$-norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum $ell_2$-norm margin classifier at the rate of $tilde{mathcal{O}}(1/sqrt{T})$, significantly faster than the rate $mathcal{O}(1/log T)$ of training with clean data. In addition, when the adversarial perturbation during training has bounded $ell_q$-norm for some $qge 1$, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum $ell_2$-norm margin classifier under worst-case $ell_q$-norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation.

Machine Learning Machine Learning

On the geometry of Stein variational gradient descent

311 - A. Duncan , N. Nuesken , L. Szpruch 2019

Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performs gains of these in various numerical experiments.

Machine Learning Machine Learning Analysis of PDEs

Learning to Learn without Gradient Descent by Gradient Descent

276 - Yutian Chen , Matthew W. Hoffman , Sergio Gomez Colmenarejo 2016

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.

Machine Learning Machine Learning

Stein Variational Gradient Descent Without Gradient

239 - Jun Han , Qiang Liu 2018

Stein variational gradient decent (SVGD) has been shown to be a powerful approximate inference algorithm for complex distributions. However, the standard SVGD requires calculating the gradient of the target density and cannot be applied when the gradient is unavailable. In this work, we develop a gradient-free variant of SVGD (GF-SVGD), which replaces the true gradient with a surrogate gradient, and corrects the induced bias by re-weighting the gradients in a proper form. We show that our GF-SVGD can be viewed as the standard SVGD with a special choice of kernel, and hence directly inherits the theoretical properties of SVGD. We shed insights on the empirical choice of the surrogate gradient and propose an annealed GF-SVGD that leverages the idea of simulated annealing to improve the performance on high dimensional complex distributions. Empirical studies show that our method consistently outperforms a number of recent advanced gradient-free MCMC methods.

Machine Learning Machine Learning

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

361 - Lenaic Chizat 2020

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

Optimization and Control Machine Learning Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

The Implicit Bias of Gradient Descent on Separable Data

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions