ﻻ يوجد ملخص باللغة العربية
The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster $mathit{second}$-$mathit{order}$ optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate ($mathit{independent}$ of the training batch size $n$), second-order algorithms incur a daunting slowdown in the $mathit{cost}$ $mathit{per}$ $mathit{iteration}$ (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an $O(mn^2)$-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width $m$. We show how to speed up the algorithm of [CGH+19], achieving an $tilde{O}(mn)$-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension ($mn$) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an $ell_2$-regression problem, and then use a Fast-JL type dimension reduction to $mathit{precondition}$ the underlying Gram matrix in time independent of $M$, allowing to find a sufficiently good approximate solution via $mathit{first}$-$mathit{order}$ conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in $mathit{convex}$ $mathit{optimization}$ (ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well.
We introduce a pruning algorithm that provably sparsifies the parameters of a trained model in a way that approximately preserves the models predictive accuracy. Our algorithm uses a small batch of input points to construct a data-informed importance
It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametri
Convolutional neural networks have achieved astonishing results in different application areas. Various methods that allow us to use these models on mobile and embedded devices have been proposed. Especially binary neural networks are a promising app
Differentially private stochastic gradient descent (DPSGD) is a variation of stochastic gradient descent based on the Differential Privacy (DP) paradigm which can mitigate privacy threats arising from the presence of sensitive information in training
Todays deep learning models are primarily trained on CPUs and GPUs. Although these models tend to have low error, they consume high power and utilize large amount of memory owing to double precision floating point learning parameters. Beyond the Moor