[edit]
GWQ: Group-Wise Quantization Framework for Neural Networks
Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:1526-1541, 2024.
Abstract
As the most commonly used quantization techniques for deep neural networks, the int-only quantization methods use scale factor to linearly approximate the weights or activation of each layer. However, when passing activation data between layers, such int-only quantization methods require extra Scale Factor Conversion (SFC) operations, resulting in computational overhead. In this paper, we propose a Group-Wise Quantization framework, called GWQ, to reduce computational consumption during the activation data pass process by allowing multiple layers share one scale factor in SFC operations. Specifically, in the GWQ framework, we propose two algorithms for network layers grouping and model training. For the grouping of network layers, we propose a grouping algorithm based on the similarity of data numerical distribution. Then, the network layers divided into the same group will be quantified using the same common scale factor to reduce the computational consumption. Considering the additional performance loss caused by sharing scale factors among multiple layers, we propose a training algorithm to optimize these shared scale factors and model parameters, by designing a learnable power-of-two scaling parameter for each layer. Extensive experiments demonstrate that the proposed GWQ framework is able to effectively reduce the computational burden during inference, while maintaining model performance with negligible impact.