A2Q+: Improving Accumulator-Aware Weight Quantization

Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:9275-9291, 2024.

Abstract

Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q)—a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We identify and characterize the various tradeoffs that arise as a consequence of accumulator constraints and support our analysis with experiments that show A2Q+ significantly improves these trade-offs when compared to prior methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-colbert24a, title = {{A}2{Q}+: Improving Accumulator-Aware Weight Quantization}, author = {Colbert, Ian and Pappalardo, Alessandro and Petri-Koenig, Jakoba and Umuroglu, Yaman}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {9275--9291}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/colbert24a/colbert24a.pdf}, url = {https://proceedings.mlr.press/v235/colbert24a.html}, abstract = {Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q)—a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We identify and characterize the various tradeoffs that arise as a consequence of accumulator constraints and support our analysis with experiments that show A2Q+ significantly improves these trade-offs when compared to prior methods.} }
Endnote
%0 Conference Paper %T A2Q+: Improving Accumulator-Aware Weight Quantization %A Ian Colbert %A Alessandro Pappalardo %A Jakoba Petri-Koenig %A Yaman Umuroglu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-colbert24a %I PMLR %P 9275--9291 %U https://proceedings.mlr.press/v235/colbert24a.html %V 235 %X Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q)—a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We identify and characterize the various tradeoffs that arise as a consequence of accumulator constraints and support our analysis with experiments that show A2Q+ significantly improves these trade-offs when compared to prior methods.
APA
Colbert, I., Pappalardo, A., Petri-Koenig, J. & Umuroglu, Y.. (2024). A2Q+: Improving Accumulator-Aware Weight Quantization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:9275-9291 Available from https://proceedings.mlr.press/v235/colbert24a.html.

Related Material