ﻻ يوجد ملخص باللغة العربية
To bridge the ever increasing gap between deep neural networks complexity and hardware capability, network quantization has attracted more and more research attention. The latest trend of mixed precision quantization takes advantage of hardwares multiple bit-width arithmetic operations to unleash the full potential of network quantization. However, this also results in a difficult integer programming formulation, and forces most existing approaches to use an extremely time-consuming search process even with various relaxations. Instead of solving a problem of the original integer programming, we propose to optimize a proxy metric, the concept of network orthogonality, which is highly correlated with the loss of the integer programming but also easy to optimize with linear programming. This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy. Specifically, on post-training quantization, we achieve 71.27% Top-1 accuracy on MobileNetV2, which only takes 9 seconds for searching and 1.4 GPU hours for finetuning on ImageNet. Our codes are avaliable at https://github.com/MAC-AutoML/OMPQ.
We consider the post-training quantization problem, which discretizes the weights of pre-trained deep neural networks without re-training the model. We propose multipoint quantization, a quantization method that approximates a full-precision weight v
Model quantization is challenging due to many tedious hyper-parameters such as precision (bitwidth), dynamic range (minimum and maximum discrete values) and stepsize (interval between discrete values). Unlike prior arts that carefully tune these valu
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantiza
Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a
Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a