Unlock the Theory behind Scaling 1-bit Neural Networks

Majid Daliri; Zhao Song; Chiwun Yang

Unlock the Theory behind Scaling 1-bit Neural Networks

Majid Daliri, Zhao Song, Chiwun Yang

Conference on Parsimony and Learning, PMLR 280:545-598, 2025.

Abstract

Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a *Scaling Law in 1-bit Neural Networks*. This paper presents the **first theoretical result** that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to $\{-1, +1\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.

Cite this Paper

BibTeX

@InProceedings{pmlr-v280-daliri25a,
  title = 	 {Unlock the Theory behind Scaling 1-bit Neural Networks},
  author =       {Daliri, Majid and Song, Zhao and Yang, Chiwun},
  booktitle = 	 {Conference on Parsimony and Learning},
  pages = 	 {545--598},
  year = 	 {2025},
  editor = 	 {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui},
  volume = 	 {280},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {24--27 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v280/main/assets/daliri25a/daliri25a.pdf},
  url = 	 {https://proceedings.mlr.press/v280/daliri25a.html},
  abstract = 	 {Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a *Scaling Law in 1-bit Neural Networks*. This paper presents the **first theoretical result** that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to $\{-1, +1\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.}
}

Endnote

%0 Conference Paper
%T Unlock the Theory behind Scaling 1-bit Neural Networks
%A Majid Daliri
%A Zhao Song
%A Chiwun Yang
%B Conference on Parsimony and Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Beidi Chen
%E Shijia Liu
%E Mert Pilanci
%E Weijie Su
%E Jeremias Sulam
%E Yuxiang Wang
%E Zhihui Zhu	
%F pmlr-v280-daliri25a
%I PMLR
%P 545--598
%U https://proceedings.mlr.press/v280/daliri25a.html
%V 280
%X Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a *Scaling Law in 1-bit Neural Networks*. This paper presents the **first theoretical result** that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to $\{-1, +1\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.

APA

Daliri, M., Song, Z. & Yang, C.. (2025). Unlock the Theory behind Scaling 1-bit Neural Networks. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:545-598 Available from https://proceedings.mlr.press/v280/daliri25a.html.

Unlock the Theory behind Scaling 1-bit Neural Networks

Abstract

Cite this Paper

Related Material