Weight Update Skipping: Reducing Training Time for Artificial Neural Networks

95 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Pooneh Safayenikoo

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Pooneh Safayenikoo - Ismail Akturk

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Artificial Neural Networks (ANNs) are known as state-of-the-art techniques in Machine Learning (ML) and have achieved outstanding results in data-intensive applications, such as recognition, classification, and segmentation. These networks mostly use deep layers of convolution or fully connected layers with many filters in each layer, demanding a large amount of data and tunable hyperparameters to achieve competitive accuracy. As a result, storage, communication, and computational costs of training (in particular training time) become limiting factors to scale them up. In this paper, we propose a new training methodology for ANNs that exploits the observation of improvement of accuracy shows temporal variations which allow us to skip updating weights when the variation is minuscule. During such time windows, we keep updating bias which ensures the network still trains and avoids overfitting; however, we selectively skip updating weights (and their time-consuming computations). Such a training approach virtually achieves the same accuracy with considerably less computational cost, thus lower training time. We propose two methods for updating weights and evaluate them by analyzing four state-of-the-art models, AlexNet, VGG-11, VGG-16, ResNet-18 on CIFAR datasets. On average, our two proposed methods called WUS and WUS+LR reduced the training time (compared to the baseline) by 54%, and 50%, respectively on CIFAR-10; and 43% and 35% on CIFAR-100, respectively.

قيم البحث

63 - Li Xiao , Zeliang Zhang , Yijie Peng 2021

Adding noises to artificial neural network(ANN) has been shown to be able to improve robustness in previous work. In this work, we propose a new technique to compute the pathwise stochastic gradient estimate with respect to the standard deviation of the Gaussian noise added to each neuron of the ANN. By our proposed technique, the gradient estimate with respect to noise levels is a byproduct of the backpropagation algorithm for estimating gradient with respect to synaptic weights in ANN. Thus, the noise level for each neuron can be optimized simultaneously in the processing of training the synaptic weights at nearly no extra computational cost. In numerical experiments, our proposed method can achieve significant performance improvement on robustness of several popular ANN structures under both black box and white box attacks tested in various computer vision datasets.

التعلم الآلي

Training (Overparametrized) Neural Networks in Near-Linear Time

96 - Jan van den Brand , Binghui Peng , Zhao Song 2020

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster $mathit{second}$-$mathit{order}$ optimization algorithms beyond SGD, with out compromising the generalization error. Despite their remarkable convergence rate ($mathit{independent}$ of the training batch size $n$), second-order algorithms incur a daunting slowdown in the $mathit{cost}$ $mathit{per}$ $mathit{iteration}$ (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an $O(mn^2)$-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width $m$. We show how to speed up the algorithm of [CGH+19], achieving an $tilde{O}(mn)$-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension ($mn$) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an $ell_2$-regression problem, and then use a Fast-JL type dimension reduction to $mathit{precondition}$ the underlying Gram matrix in time independent of $M$, allowing to find a sufficiently good approximate solution via $mathit{first}$-$mathit{order}$ conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in $mathit{convex}$ $mathit{optimization}$ (ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well.

التعلم الآلي بنى وهياكل البيانات والخوارزميات التعلم الالي

Sobolev Training for Neural Networks

188 - Wojciech Marian Czarnecki , Simon Osindero , Max Jaderberg 2017

At the heart of deep learning we aim to use neural networks as function approximators - training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output p airs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input - for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the functions outputs but also the functions derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.

التعلم الآلي

Tailoring Artificial Neural Networks for Optimal Learning

452 - Pau Vilimelis Aceituno , Yan Gang , Yang-Yu Liu 2017

As one of the most important paradigms of recurrent neural networks, the echo state network (ESN) has been applied to a wide range of fields, from robotics to medicine, finance, and language processing. A key feature of the ESN paradigm is its reserv oir --- a directed and weighted network of neurons that projects the input time series into a high dimensional space where linear regression or classification can be applied. Despite extensive studies, the impact of the reservoir network on the ESN performance remains unclear. Combining tools from physics, dynamical systems and network science, we attempt to open the black box of ESN and offer insights to understand the behavior of general artificial neural networks. Through spectral analysis of the reservoir network we reveal a key factor that largely determines the ESN memory capacity and hence affects its performance. Moreover, we find that adding short loops to the reservoir network can tailor ESN for specific tasks and optimize learning. We validate our findings by applying ESN to forecast both synthetic and real benchmark time series. Our results provide a new way to design task-specific ESN. More importantly, it demonstrates the power of combining tools from physics, dynamical systems and network science to offer new insights in understanding the mechanisms of general artificial neural networks.

التعلم الآلي الحوسبة العصبية والتطورية

Variability of Artificial Neural Networks

249 - Yin Zhang , Yueyao Yu 2021

What makes an artificial neural network easier to train and more likely to produce desirable solutions than other comparable networks? In this paper, we provide a new angle to study such issues under the setting of a fixed number of model parameters which in general is the most dominant cost factor. We introduce a notion of variability and show that it correlates positively to the activation ratio and negatively to a phenomenon called {Collapse to Constants} (or C2C), which is closely related but not identical to the phenomenon commonly known as vanishing gradient. Experiments on a styled model problem empirically verify that variability is indeed a key performance indicator for fully connected neural networks. The insights gained from this variability study will help the design of new and effective neural network architectures.

التعلم الآلي