[edit]
A Mixed-Precision Quantization Method without Accuracy Degradation Using Semilayers
Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:882-894, 2024.
Abstract
Reducing the memory usage and computational complexity of high-performance deep neural networks while minimizing degradation of accuracy is a key issue in implementing these models on edge devices. To address this issue, partial quantization methods have been proposed to partially reduce the weight parameters of neural network models. However, the accuracy of existing methods degrades rapidly with increasing compression ratio. Although retraining can compensate for this issue to some extent, it is computationally very expensive. In this study, we propose a mixed-precision quantization algorithm without retraining or degradation in accuracy. In the proposed method, first, the difference between values after and before quantization losses of each channel in the layers of the pretrained model is calculated for all channels. Next, the layers are divided into two groups called semilayers according to whether the loss difference is positive or negative. The priorities for quantization in the semilayers are determined based on the Kulback-Leibler divergence derived from the probability distribution of the softmax output after and before quantization. The same process is repeated as a mixed-precision quantization while gradually decreasing the bitwidth, for example, with 8-, 6-, and 4-bit quantizations, and so forth. The results of an experimental evaluation show that the proposed method successfully compressed a ResNet-18 model by 81.44%, a ResNet-34 model by 84.25%, and a ResNet-50 model by 80.39% on image classification tasks using the ImageNet dataset, and a ResNet-18 model by 80.56% on image classification tasks using the CIFAR-10 dataset, with no degradation of the inference accuracy of the pretrained models.