ﻻ يوجد ملخص باللغة العربية
This work is substituted by the paper in arXiv:2011.14066. Stochastic gradient descent is the de facto algorithm for training deep neural networks (DNNs). Despite its popularity, it still requires fine tuning in order to achieve its best performance. This has led to the development of adaptive methods, that claim automatic hyper-parameter optimization. Recently, researchers have studied both algorithmic classes via toy examples: e.g., for over-parameterized linear regression, Wilson et. al. (2017) shows that, while SGD always converges to the minimum-norm solution, adaptive methods show no such inclination, leading to worse generalization capabilities. Our aim is to study this conjecture further. We empirically show that the minimum weight norm is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods. In practical DNN settings, we observe that adaptive methods can outperform SGD, with larger weight norm output models, but without necessarily reducing the amount of tuning required.
Recently it was shown in several papers that backpropagation is able to find the global minimum of the empirical risk on the training data using over-parametrized deep neural networks. In this paper a similar result is shown for deep neural networks
Recently, several studies have proven the global convergence and generalization abilities of the gradient descent method for two-layer ReLU networks. Most studies especially focused on the regression problems with the squared loss function, except fo
Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor
Over-parameterization and adaptive methods have played a crucial role in the success of deep learning in the last decade. The widespread use of over-parameterization has forced us to rethink generalization by bringing forth new phenomena, such as imp
Recent work has argued that neural networks can be understood theoretically by taking the number of channels to infinity, at which point the outputs become Gaussian process (GP) distributed. However, we note that infinite Bayesian neural networks lac