ﻻ يوجد ملخص باللغة العربية
Frequency perspective recently makes progress in understanding deep learning. It has been widely verified in both empirical and theoretical studies that deep neural networks (DNNs) often fit the target function from low to high frequency, namely Frequency Principle (F-Principle). F-Principle sheds light on the strength and the weakness of DNNs and inspires a series of subsequent works, including theoretical studies, empirical studies and the design of efficient DNN structures etc. Previous works examine the F-Principle in gradient-descent-based training. It remains unclear whether gradient-descent-based training is a necessary condition for the F-Principle. In this paper, we show that the F-Principle exists stably in the training process of DNNs with non-gradient-descent-based training, including optimization algorithms with gradient information, such as conjugate gradient and BFGS, and algorithms without gradient information, such as Powells method and Particle Swarm Optimization. These empirical studies show the universality of the F-Principle and provide hints for further study of F-Principle.
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the curren
Non-convex optimization problems are challenging to solve; the success and computational expense of a gradient descent algorithm or variant depend heavily on the initialization strategy. Often, either random initialization is used or initialization r
Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient
The simplicity of gradient descent (GD) made it the default method for training ever-deeper and complex neural networks. Both loss functions and architectures are often explicitly tuned to be amenable to this basic local optimization. In the context
We introduce a method called TracIn that computes the influence of a training example on a prediction made by the model. The idea is to trace how the loss on the test point changes during the training process whenever the training example of interest