Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training

57 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Shiwei Liu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Shiwei Liu - Lu Yin - Decebal Constantin Mocanu

التعلم الآلي الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and indicate that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as there are sufficient parameters that have been reliably explored during training, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, our method achieves dominant performance over the overparameterization-based sparse methods at extreme sparsity levels. When trained on CIFAR-100, our method can match the performance of the dense model even at an extreme sparsity (98%). Code can be found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization.

قيم البحث

134 - Sharan Vaswani , Issam Laradji , Frederik Kunstner 2020

Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data. In this s etting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. When interpolation is only approximately satisfied, constant step-size AMSGrad converges to a neighbourhood of the solution at the same rate, while AdaGrad is robust to the violation of interpolation. However, even for simple convex problems satisfying interpolation, the empirical performance of both methods heavily depends on the step-size and requires tuning, questioning their adaptivity. We alleviate this problem by automatically determining the step-size using stochastic line-search or Polyak step-sizes. With these techniques, we prove that both AdaGrad and AMSGrad retain their convergence guarantees, without needing to know problem-dependent constants. Empirically, we demonstrate that these techniques improve the convergence and generalization of adaptive gradient methods across tasks, from binary classification with kernel mappings to multi-class classification with deep networks.

التعلم الآلي التحسين والتحكم التعلم الالي

A Convergence Theory for Deep Learning via Over-Parameterization

375 - Zeyuan Allen-Zhu , Yuanzhi Li , Zhao Song 2018

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $textit{global minima}$ on the training objective of DNNs in $textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

التعلم الآلي بنى وهياكل البيانات والخوارزميات الحوسبة العصبية والتطورية

On Connectivity of Solutions in Deep Learning: The Role of Over-parameterization and Feature Quality

89 - Quynh Nguyen , Pierre Brechet , Marco Mondelli 2021

It has been empirically observed that, in deep neural networks, the solutions found by stochastic gradient descent from different random initializations can be often connected by a path with low loss. Recent works have shed light on this intriguing p henomenon by assuming either the over-parameterization of the network or the dropout stability of the solutions. In this paper, we reconcile these two views and present a novel condition for ensuring the connectivity of two arbitrary points in parameter space. This condition is provably milder than dropout stability, and it provides a connection between the problem of finding low-loss paths and the memorization capacity of neural nets. This last point brings about a trade-off between the quality of features at each layer and the over-parameterization of the network. As an extreme example of this trade-off, we show that (i) if subsets of features at each layer are linearly separable, then almost no over-parameterization is needed, and (ii) under generic assumptions on the features at each layer, it suffices that the last two hidden layers have $Omega(sqrt{N})$ neurons, $N$ being the number of samples. Finally, we provide experimental evidence demonstrating that the presented condition is satisfied in practical settings even when dropout stability does not hold.

التعلم الآلي التعلم الالي

ExpandNets: Linear Over-parameterization to Train Compact Convolutional Networks

85 - Shuxuan Guo , Jose M. Alvarez , Mathieu Salzmann 2018

We introduce an approach to training a given compact network. To this end, we leverage over-parameterization, which typically improves both neural network optimization and generalization. Specifically, we propose to expand each linear layer of the co mpact network into multiple consecutive linear layers, without adding any nonlinearity. As such, the resulting expanded network, or ExpandNet, can be contracted back to the compact one algebraically at inference. In particular, we introduce two convolutional expansion strategies and demonstrate their benefits on several tasks, including image classification, object detection, and semantic segmentation. As evidenced by our experiments, our approach outperforms both training the compact network from scratch and performing knowledge distillation from a teacher. Furthermore, our linear over-parameterization empirically reduces gradient confusion during training and improves the network generalization.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Higher-order Processes with Parameterization over Names and Processes

87 - Xian Xu 2016

Parameterization extends higher-order processes with the capability of abstraction and application (like those in lambda-calculus). This extension is strict, i.e., higher-order processes equipped with parameterization is computationally more powerful . This paper studies higher-order processes with two kinds of parameterization: one on names and the other on processes themselves. We present two results. One is that in presence of parameterization, higher-order processes can encode first-order (name-passing) processes in a quite neat fashion, in contrast to the fact that higher-order processes without parameterization cannot encode first-order processes at all. In the other result, we provide a simpler characterization of the (standard) context bisimulation for higher-order processes with parameterization, in terms of the normal bisimulation that stems from the well-known normal characterization for higher-order calculus. These two results demonstrate more essence of the parameterization method in the higher-order paradigm toward expressiveness and behavioural equivalence.

المنطق في علوم الحاسوب