GWQ: Group-Wise Quantization Framework for Neural Networks

Jiaming Yang, Chenwei Tang, Caiyang Yu, Jiancheng Lv
Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:1526-1541, 2024.

Abstract

As the most commonly used quantization techniques for deep neural networks, the int-only quantization methods use scale factor to linearly approximate the weights or activation of each layer. However, when passing activation data between layers, such int-only quantization methods require extra Scale Factor Conversion (SFC) operations, resulting in computational overhead. In this paper, we propose a Group-Wise Quantization framework, called GWQ, to reduce computational consumption during the activation data pass process by allowing multiple layers share one scale factor in SFC operations. Specifically, in the GWQ framework, we propose two algorithms for network layers grouping and model training. For the grouping of network layers, we propose a grouping algorithm based on the similarity of data numerical distribution. Then, the network layers divided into the same group will be quantified using the same common scale factor to reduce the computational consumption. Considering the additional performance loss caused by sharing scale factors among multiple layers, we propose a training algorithm to optimize these shared scale factors and model parameters, by designing a learnable power-of-two scaling parameter for each layer. Extensive experiments demonstrate that the proposed GWQ framework is able to effectively reduce the computational burden during inference, while maintaining model performance with negligible impact.

Cite this Paper


BibTeX
@InProceedings{pmlr-v222-yang24a, title = {{GWQ}: {G}roup-Wise Quantization Framework for Neural Networks}, author = {Yang, Jiaming and Tang, Chenwei and Yu, Caiyang and Lv, Jiancheng}, booktitle = {Proceedings of the 15th Asian Conference on Machine Learning}, pages = {1526--1541}, year = {2024}, editor = {Yanıkoğlu, Berrin and Buntine, Wray}, volume = {222}, series = {Proceedings of Machine Learning Research}, month = {11--14 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v222/yang24a/yang24a.pdf}, url = {https://proceedings.mlr.press/v222/yang24a.html}, abstract = {As the most commonly used quantization techniques for deep neural networks, the int-only quantization methods use scale factor to linearly approximate the weights or activation of each layer. However, when passing activation data between layers, such int-only quantization methods require extra Scale Factor Conversion (SFC) operations, resulting in computational overhead. In this paper, we propose a Group-Wise Quantization framework, called GWQ, to reduce computational consumption during the activation data pass process by allowing multiple layers share one scale factor in SFC operations. Specifically, in the GWQ framework, we propose two algorithms for network layers grouping and model training. For the grouping of network layers, we propose a grouping algorithm based on the similarity of data numerical distribution. Then, the network layers divided into the same group will be quantified using the same common scale factor to reduce the computational consumption. Considering the additional performance loss caused by sharing scale factors among multiple layers, we propose a training algorithm to optimize these shared scale factors and model parameters, by designing a learnable power-of-two scaling parameter for each layer. Extensive experiments demonstrate that the proposed GWQ framework is able to effectively reduce the computational burden during inference, while maintaining model performance with negligible impact.} }
Endnote
%0 Conference Paper %T GWQ: Group-Wise Quantization Framework for Neural Networks %A Jiaming Yang %A Chenwei Tang %A Caiyang Yu %A Jiancheng Lv %B Proceedings of the 15th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Berrin Yanıkoğlu %E Wray Buntine %F pmlr-v222-yang24a %I PMLR %P 1526--1541 %U https://proceedings.mlr.press/v222/yang24a.html %V 222 %X As the most commonly used quantization techniques for deep neural networks, the int-only quantization methods use scale factor to linearly approximate the weights or activation of each layer. However, when passing activation data between layers, such int-only quantization methods require extra Scale Factor Conversion (SFC) operations, resulting in computational overhead. In this paper, we propose a Group-Wise Quantization framework, called GWQ, to reduce computational consumption during the activation data pass process by allowing multiple layers share one scale factor in SFC operations. Specifically, in the GWQ framework, we propose two algorithms for network layers grouping and model training. For the grouping of network layers, we propose a grouping algorithm based on the similarity of data numerical distribution. Then, the network layers divided into the same group will be quantified using the same common scale factor to reduce the computational consumption. Considering the additional performance loss caused by sharing scale factors among multiple layers, we propose a training algorithm to optimize these shared scale factors and model parameters, by designing a learnable power-of-two scaling parameter for each layer. Extensive experiments demonstrate that the proposed GWQ framework is able to effectively reduce the computational burden during inference, while maintaining model performance with negligible impact.
APA
Yang, J., Tang, C., Yu, C. & Lv, J.. (2024). GWQ: Group-Wise Quantization Framework for Neural Networks. Proceedings of the 15th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 222:1526-1541 Available from https://proceedings.mlr.press/v222/yang24a.html.

Related Material