ترغب بنشر مسار تعليمي؟ اضغط هنا

Ternary Neural Networks with Fine-Grained Quantization

420   0   0.0 ( 0 )
 نشر من قبل Naveen Mellempudi
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We propose a novel fine-grained quantization (FGQ) method to ternarize pre-trained full precision models, while also constraining activations to 8 and 4-bits. Using this method, we demonstrate a minimal loss in classification accuracy on state-of-the-art topologies without additional training. We provide an improved theoretical formulation that forms the basis for a higher quality solution using FGQ. Our method involves ternarizing the original weight tensor in groups of $N$ weights. Using $N=4$, we achieve Top-1 accuracy within $3.7%$ and $4.2%$ of the baseline full precision result for Resnet-101 and Resnet-50 respectively, while eliminating $75%$ of all multiplications. These results enable a full 8/4-bit inference pipeline, with best-reported accuracy using ternary weights on ImageNet dataset, with a potential of $9times$ improvement in performance. Also, for smaller networks like AlexNet, FGQ achieves state-of-the-art results. We further study the impact of group size on both performance and accuracy. With a group size of $N=64$, we eliminate $approx99%$ of the multiplications; however, this introduces a noticeable drop in accuracy, which necessitates fine tuning the parameters at lower precision. We address this by fine-tuning Resnet-50 with 8-bit activations and ternary weights at $N=64$, improving the Top-1 accuracy to within $4%$ of the full precision result with $<30%$ additional training overhead. Our final quantized model can run on a full 8-bit compute pipeline using 2-bit weights and has the potential of up to $15times$ improvement in performance compared to baseline full-precision models.

قيم البحث

اقرأ أيضاً

Existing uncertainty modeling approaches try to detect an out-of-distribution point from the in-distribution dataset. We extend this argument to detect finer-grained uncertainty that distinguishes between (a). certain points, (b). uncertain points bu t within the data distribution, and (c). out-of-distribution points. Our method corrects overconfident NN decisions, detects outlier points and learns to say ``I dont know when uncertain about a critical point between the top two predictions. In addition, we provide a mechanism to quantify class distributions overlap in the decision manifold and investigate its implications in model interpretability. Our method is two-step: in the first step, the proposed method builds a class distribution using Kernel Activation Vectors (kav) extracted from the Network. In the second step, the algorithm determines the confidence of a test point by a hierarchical decision rule based on the chi-squared distribution of squared Mahalanobis distances. Our method sits on top of a given Neural Network, requires a single scan of training data to estimate class distribution statistics, and is highly scalable to deep networks and wider pre-softmax layer. As a positive side effect, our method helps to prevent adversarial attacks without requiring any additional training. It is directly achieved when the Softmax layer is substituted by our robust uncertainty layer at the evaluation phase.
193 - Tianyu Zhang , Lei Zhu , Qian Zhao 2019
Quantization of weights of deep neural networks (DNN) has proven to be an effective solution for the purpose of implementing DNNs on edge devices such as mobiles, ASICs and FPGAs, because they have no sufficient resources to support computation invol ving millions of high precision weights and multiply-accumulate operations. This paper proposes a novel method to compress vectors of high precision weights of DNNs to ternary vectors, namely a cosine similarity based target non-retraining ternary (TNT) compression method. Our method leverages cosine similarity instead of Euclidean distances as commonly used in the literature and succeeds in reducing the size of the search space to find optimal ternary vectors from 3N to N, where N is the dimension of target vectors. As a result, the computational complexity for TNT to find theoretically optimal ternary vectors is only O(N log(N)). Moreover, our experiments show that, when we ternarize models of DNN with high precision parameters, the obtained quantized models can exhibit sufficiently high accuracy so that re-training models is not necessary.
We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time and when computing the parameters gradient at train-time. We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano, where we train BNNs on MNIST, CIFAR-10 and SVHN, and achieve nearly state-of-the-art results. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available.
Deep neural networks currently demonstrate state-of-the-art performance in several domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved. In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.
The computation and storage requirements for Deep Neural Networks (DNNs) are usually high. This issue limits their deployability on ubiquitous computing devices such as smart phones, wearables and autonomous drones. In this paper, we propose ternary neural networks (TNNs) in order to make deep learning more resource-efficient. We train these TNNs using a teacher-student approach based on a novel, layer-wise greedy methodology. Thanks to our two-stage training procedure, the teacher network is still able to use state-of-the-art methods such as dropout and batch normalization to increase accuracy and reduce training time. Using only ternary weights and activations, the student ternary network learns to mimic the behavior of its teacher network without using any multiplication. Unlike its -1,1 binary counterparts, a ternary neural network inherently prunes the smaller weights by setting them to zero during training. This makes them sparser and thus more energy-efficient. We design a purpose-built hardware architecture for TNNs and implement it on FPGA and ASIC. We evaluate TNNs on several benchmark datasets and demonstrate up to 3.1x better energy efficiency with respect to the state of the art while also improving accuracy.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا