ﻻ يوجد ملخص باللغة العربية
The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. This work attempts to answer a question of interest to both researchers and practitioners, namely textit{how much knowledge is gained in iterative training of deep neural networks?} Answering this question introduces two useful metrics derived from the singular values of the low-rank factorization of convolution layers in deep neural networks. We introduce the notions of textit{knowledge gain} and textit{mapping condition} and propose a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training. Code is available at url{https://github.com/mahdihosseini/AdaS}.
We present an adaptive stochastic variance reduced method with an implicit approach for adaptivity. As a variant of SARAH, our method employs the stochastic recursive gradient yet adjusts step-size based on local geometry. We provide convergence guar
We present the remote stochastic gradient (RSG) method, which computes the gradients at configurable remote observation points, in order to improve the convergence rate and suppress gradient noise at the same time for different curvatures. RSG is fur
Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. Typically, the gradient of any such loss function fails to be dissipative making the
We investigate 1) the rate at which refined properties of the empirical risk---in particular, gradients---converge to their population counterparts in standard non-convex learning tasks, and 2) the consequences of this convergence for optimization. O
In this paper, we consider the problem of real-time transmission scheduling over time-varying channels. We first formulate the transmission scheduling problem as a Markov decision process (MDP) and systematically unravel the structural properties (e.