ﻻ يوجد ملخص باللغة العربية
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $eta$ to the batch-size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index, which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $eta$ and $b$, the SGD iterates will converge to a emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.
The theory and practice of stochastic optimization has focused on stochastic gradient descent (SGD) in recent years, retaining the basic first-order stochastic nature of SGD while aiming to improve it via mechanisms such as averaging, momentum, and v
Adaptive gradient methods have attracted much attention of machine learning communities due to the high efficiency. However their acceleration effect in practice, especially in neural network training, is hard to analyze, theoretically. The huge gap
We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on
Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning
We propose a stochastic process driven by memory effect with novel distributions including both exponential and leptokurtic heavy-tailed distributions. A class of distribution is analytically derived from the continuum limit of the discrete binary pr