Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Path-SGD: Path-Normalized Optimization in Deep Neural Networks

583 0 0.0 ( 0 )

Download Cite

Added by Behnam Neyshabur

Publication date 2015

fields Informatics Engineering

and research's language is English

Authors Behnam Neyshabur - Ruslan Salakhutdinov - Nathan Srebro

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.

rate research

Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

109 - Behnam Neyshabur , Yuhuai Wu , Ruslan Salakhutdinov 2016

We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.

Machine Learning Neural and Evolutionary Computing

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

184 - Xiaohan Ding , Guiguang Ding , Xiangxin Zhou 2019

Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices. DNN pruning is an approach for deep model compression, which aims at eliminating some parameters with tolerable performance degradation. In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning. Concretely, given a global compression ratio, we categorize all the parameters into two parts at each training iteration which are updated using different rules. In this way, we gradually zero out the redundant parameters, as we update them using only the ordinary weight decay but no gradients derived from the objective function. As a departure from prior methods that require heavy human works to tune the layer-wise sparsity ratios, prune by solving complicated non-differentiable problems or finetune the model after pruning, our method is characterized by 1) global compression that automatically finds the appropriate per-layer sparsity ratios; 2) end-to-end training; 3) no need for a time-consuming re-training process after pruning; and 4) superior capability to find better winning tickets which have won the initialization lottery.

Machine Learning Computer Vision and Pattern Recognition Machine Learning

Data-Dependent Path Normalization in Neural Networks

82 - Behnam Neyshabur , Ryota Tomioka , Ruslan Salakhutdinov 2015

We propose a unified framework for neural net normalization, regularization and optimization, which includes Path-SGD and Batch-Normalization and interpolates between them across two different dimensions. Through this framework we investigate issue of invariance of the optimization, data dependence and the connection with natural gradients.

Machine Learning

Neural Path Features and Neural Path Kernel : Understanding the role of gates in deep learning

145 - Chandrashekar Lakshminarayanan , Amit Vikram Singh 2020

Rectified linear unit (ReLU) activations can also be thought of as gates, which, either pass or stop their pre-activation input when they are on (when the pre-activation input is positive) or off (when the pre-activation input is negative) respectively. A deep neural network (DNN) with ReLU activations has many gates, and the on/off status of each gate changes across input examples as well as network weights. For a given input example, only a subset of gates are active, i.e., on, and the sub-network of weights connected to these active gates is responsible for producing the output. At randomised initialisation, the active sub-network corresponding to a given input example is random. During training, as the weights are learnt, the active sub-networks are also learnt, and potentially hold very valuable information. In this paper, we analytically characterise the role of active sub-networks in deep learning. To this end, we encode the on/off state of the gates of a given input in a novel neural path feature (NPF), and the weights of the DNN are encoded in a novel neural path value (NPV). Further, we show that the output of network is indeed the inner product of NPF and NPV. The main result of the paper shows that the neural path kernel associated with the NPF is a fundamental quantity that characterises the information stored in the gates of a DNN. We show via experiments (on MNIST and CIFAR-10) that in standard DNNs with ReLU activations NPFs are learnt during training and such learning is key for generalisation. Furthermore, NPFs and NPVs can be learnt in two separate networks and such learning also generalises well in experiments.

Machine Learning Machine Learning

Compressing deep neural networks by matrix product operators

116 - Ze-Feng Gao , Song Cheng , Rong-Qiang He 2019

A deep neural network is a parametrization of a multilayer mapping of signals in terms of many alternatively arranged linear and nonlinear transformations. The linear transformations, which are generally used in the fully connected as well as convolutional layers, contain most of the variational parameters that are trained and stored. Compressing a deep neural network to reduce its number of variational parameters but not its prediction power is an important but challenging problem toward the establishment of an optimized scheme in training efficiently these parameters and in lowering the risk of overfitting. Here we show that this problem can be effectively solved by representing linear transformations with matrix product operators (MPOs), which is a tensor network originally proposed in physics to characterize the short-range entanglement in one-dimensional quantum states. We have tested this approach in five typical neural networks, including FC2, LeNet-5, VGG, ResNet, and DenseNet on two widely used data sets, namely, MNIST and CIFAR-10, and found that this MPO representation indeed sets up a faithful and efficient mapping between input and output signals, which can keep or even improve the prediction accuracy with a dramatically reduced number of parameters. Our method greatly simplifies the representations in deep learning, and opens a possible route toward establishing a framework of modern neural networks which might be simpler and cheaper, but more efficient.

Machine Learning Computer Vision and Pattern Recognition Neural and Evolutionary Computing

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions