ﻻ يوجد ملخص باللغة العربية
This paper presents a novel network compression framework Kernel Quantization (KQ), targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Unlike existing methods struggling with weight bit-length, KQ has the potential in improving the compression ratio by considering the convolution kernel as the quantization unit. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Instead of representing each weight parameter with a low-bit index, we learn a kernel codebook and replace all kernels in the convolution layer with corresponding low-bit indexes. Thus, KQ can represent the weight tensor in the convolution layer with low-bit indexes and a kernel codebook with limited size, which enables KQ to achieve significant compression ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook to further reduce redundancy. Extensive experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer and achieves the state-of-the-art compression ratio with little accuracy loss.
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approxim
Deep Neural Networks (DNNs) are applied in a wide range of usecases. There is an increased demand for deploying DNNs on devices that do not have abundant resources such as memory and computation units. Recently, network compression through a variety
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried ou
Wavelets are well known for data compression, yet have rarely been applied to the compression of neural networks. This paper shows how the fast wavelet transform can be used to compress linear layers in neural networks. Linear layers still occupy a s
Quantized Neural Networks (QNNs) use low bit-width fixed-point numbers for representing weight parameters and activations, and are often used in real-world applications due to their saving of computation resources and reproducibility of results. Ba