ﻻ يوجد ملخص باللغة العربية
Lottery ticket hypothesis suggests that a dense neural network contains a sparse sub-network that can match the test accuracy of the original dense net when trained in isolation from (the same) random initialization. However, the hypothesis failed to generalize to larger dense networks such as ResNet-50. As a remedy, recent studies demonstrate that a sparse sub-network can still be obtained by using a rewinding technique, which is to re-train it from early-phase training weights or learning rates of the dense model, rather than from random initialization. Is rewinding the only or the best way to scale up lottery tickets? This paper proposes a new, simpler and yet powerful technique for re-training the sub-network, called Knowledge Distillation ticket (KD ticket). Rewinding exploits the value of inheriting knowledge from the early training phase to improve lottery tickets in large networks. In comparison, KD ticket addresses a complementary possibility - inheriting useful knowledge from the late training phase of the dense model. It is achieved by leveraging the soft labels generated by the trained dense model to re-train the sub-network, instead of the hard labels. Extensive experiments are conducted using several large deep networks (e.g ResNet-50 and ResNet-110) on CIFAR-10 and ImageNet datasets. Without bells and whistles, when applied by itself, KD ticket performs on par or better than rewinding, while being nearly free of hyperparameters or ad-hoc selection. KD ticket can be further applied together with rewinding, yielding state-of-the-art results for large-scale lottery tickets.
We introduce a generalization to the lottery ticket hypothesis in which the notion of sparsity is relaxed by choosing an arbitrary basis in the space of parameters. We present evidence that the original results reported for the canonical basis contin
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse mo
$textit{RigL}$, a sparse training algorithm, claims to directly train sparse networks that match or exceed the performance of existing dense-to-sparse training techniques (such as pruning) for a fixed parameter count and compute budget. We implement
We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the po
Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random init