Do you want to publish a course? Click here

Training DNNs in O(1) memory with MEM-DFA using Random Matrices

245   0   0.0 ( 0 )
 Added by Zbigniew Wojna
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

This work presents a method for reducing memory consumption to a constant complexity when training deep neural networks. The algorithm is based on the more biologically plausible alternatives of the backpropagation (BP): direct feedback alignment (DFA) and feedback alignment (FA), which use random matrices to propagate error. The proposed method, memory-efficient direct feedback alignment (MEM-DFA), uses higher independence of layers in DFA and allows avoiding storing at once all activation vectors, unlike standard BP, FA, and DFA. Thus, our algorithms memory usage is constant regardless of the number of layers in a neural network. The method increases the computational cost only by a constant factor of one extra forward pass. The MEM-DFA, BP, FA, and DFA were evaluated along with their memory profiles on MNIST and CIFAR-10 datasets on various neural network models. Our experiments agree with our theoretical results and show a significant decrease in the memory cost of MEM-DFA compared to the other algorithms.



rate research

Read More

60 - Lei Huang , Li Liu , Fan Zhu 2020
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newtons iteration (ONI), to learn a layer-wise orthogonal weight matrix in DNNs. ONI works by iteratively stretching the singular values of a weight matrix towards 1. This property enables it to control the orthogonality of a weight matrix by its number of iterations. We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (SN), and further outperforms SN by providing controllable orthogonality.
The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full precision floating-point arithmetic to maximize performance per area. Ongoing research efforts seek to further increase that performance density by replacing floating-point with fixed-point arithmetic. However, a significant roadblock for these attempts has been fixed points narrow dynamic range, which is insufficient for DNN training convergence. We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic. Unfortunately, BFP alone introduces several limitations that preclude its direct applicability. In this work, we introduce HBFP, a hybrid BFP-FP approach, which performs all dot products in BFP and other operations in floating point. HBFP delivers the best of both worlds: the high accuracy of floating point at the superior hardware density of fixed point. For a wide variety of models, we show that HBFP matches floating points accuracy while enabling hardware implementations that deliver up to 8.5x higher throughput.
This paper proposes Quantizable DNNs, a special type of DNNs that can flexibly quantize its bit-width (denoted as `bit modes thereafter) during execution without further re-training. To simultaneously optimize for all bit modes, a combinational loss of all bit modes is proposed, which enforces consistent predictions ranging from low-bit mode to 32-bit mode. This Consistency-based Loss may also be viewed as certain form of regularization during training. Because outputs of matrix multiplication in different bit modes have different distributions, we introduce Bit-Specific Batch Normalization so as to reduce conflicts among different bit modes. Experiments on CIFAR100 and ImageNet have shown that compared to quantized DNNs, Quantizable DNNs not only have much better flexibility, but also achieve even higher classification accuracy. Ablation studies further verify that the regularization through the consistency-based loss indeed improves the models generalization performance.
It is well-known that spatial averaging can be realized (in space or frequency domain) using algorithms whose complexity does not depend on the size or shape of the filter. These fast algorithms are generally referred to as constant-time or O(1) algorithms in the image processing literature. Along with the spatial filter, the edge-preserving bilateral filter [Tomasi1998] involves an additional range kernel. This is used to restrict the averaging to those neighborhood pixels whose intensity are similar or close to that of the pixel of interest. The range kernel operates by acting on the pixel intensities. This makes the averaging process non-linear and computationally intensive, especially when the spatial filter is large. In this paper, we show how the O(1) averaging algorithms can be leveraged for realizing the bilateral filter in constant-time, by using trigonometric range kernels. This is done by generalizing the idea in [Porikli2008] of using polynomial range kernels. The class of trigonometric kernels turns out to be sufficiently rich, allowing for the approximation of the standard Gaussian bilateral filter. The attractive feature of our approach is that, for a fixed number of terms, the quality of approximation achieved using trigonometric kernels is much superior to that obtained in [Porikli2008] using polynomials.
We consider ensembles of real symmetric band matrices with entries drawn from an infinite sequence of exchangeable random variables, as far as the symmetry of the matrices permits. In general the entries of the upper triangular parts of these matrices are correlated and no smallness or sparseness of these correlations is assumed. It is shown that the eigenvalue distribution measures still converge to a semicircle but with random scaling. We also investigate the asymptotic behavior of the corresponding $ell_2$-operator norms. The key to our analysis is a generalisation of a classic result by de Finetti that allows to represent the underlying probability spaces as averages of Wigner band ensembles with entries that are not necessarily centred. Some of our results appear to be new even for such Wigner band matrices.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا