1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

Hanlin Tang; Shaoduo Gan; Ammar Ahmad Awan; Samyam Rajbhandari; Conglong Li; Xiangru Lian; Ji Liu; Ce Zhang; Yuxiong He

1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:10118-10129, 2021.

Abstract

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to 5x, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam’s variance becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). We performed experiments on up to 256 GPUs and show that 1-bit Adam enables up to 3.3x higher throughput for BERT-Large pre-training and up to 2.9x higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for 1-bit Adam.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-tang21a,
  title = 	 {1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed},
  author =       {Tang, Hanlin and Gan, Shaoduo and Awan, Ammar Ahmad and Rajbhandari, Samyam and Li, Conglong and Lian, Xiangru and Liu, Ji and Zhang, Ce and He, Yuxiong},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {10118--10129},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/tang21a/tang21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/tang21a.html},
  abstract = 	 {Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to 5x, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam’s variance becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). We performed experiments on up to 256 GPUs and show that 1-bit Adam enables up to 3.3x higher throughput for BERT-Large pre-training and up to 2.9x higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for 1-bit Adam.}
}

Endnote

%0 Conference Paper
%T 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
%A Hanlin Tang
%A Shaoduo Gan
%A Ammar Ahmad Awan
%A Samyam Rajbhandari
%A Conglong Li
%A Xiangru Lian
%A Ji Liu
%A Ce Zhang
%A Yuxiong He
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-tang21a
%I PMLR
%P 10118--10129
%U https://proceedings.mlr.press/v139/tang21a.html
%V 139
%X Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to 5x, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam’s variance becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). We performed experiments on up to 256 GPUs and show that 1-bit Adam enables up to 3.3x higher throughput for BERT-Large pre-training and up to 2.9x higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for 1-bit Adam.

APA

Tang, H., Gan, S., Awan, A.A., Rajbhandari, S., Li, C., Lian, X., Liu, J., Zhang, C. & He, Y.. (2021). 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:10118-10129 Available from https://proceedings.mlr.press/v139/tang21a.html.

1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

Abstract

Cite this Paper

Related Material