Direct Quantized Training of Language Models with Stochastic Rounding

Kaiyan Zhao, Tsuguchika Tabaru, Kenichi Kobayashi, Takumi Honda, Masafumi Yamazaki, Yoshimasa Tsuruoka
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:1150-1165, 2025.

Abstract

Although recent quantized Large Language Models, such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estima- tion during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-zhao25b, title = {Direct Quantized Training of Language Models with Stochastic Rounding}, author = {Zhao, Kaiyan and Tabaru, Tsuguchika and Kobayashi, Kenichi and Honda, Takumi and Yamazaki, Masafumi and Tsuruoka, Yoshimasa}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {1150--1165}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/zhao25b/zhao25b.pdf}, url = {https://proceedings.mlr.press/v304/zhao25b.html}, abstract = {Although recent quantized Large Language Models, such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estima- tion during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.} }
Endnote
%0 Conference Paper %T Direct Quantized Training of Language Models with Stochastic Rounding %A Kaiyan Zhao %A Tsuguchika Tabaru %A Kenichi Kobayashi %A Takumi Honda %A Masafumi Yamazaki %A Yoshimasa Tsuruoka %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-zhao25b %I PMLR %P 1150--1165 %U https://proceedings.mlr.press/v304/zhao25b.html %V 304 %X Although recent quantized Large Language Models, such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estima- tion during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.
APA
Zhao, K., Tabaru, T., Kobayashi, K., Honda, T., Yamazaki, M. & Tsuruoka, Y.. (2025). Direct Quantized Training of Language Models with Stochastic Rounding. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:1150-1165 Available from https://proceedings.mlr.press/v304/zhao25b.html.

Related Material