CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks

Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, Ce Zhang
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:36058-36076, 2023.

Abstract

Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques – random sparsification, top-K sparsification, and quantization – to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117$\times$ compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs $\sim$1.2$\times$ slowdown compared with data center networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-wang23t, title = {{C}ocktail{SGD}: Fine-tuning Foundation Models over 500{M}bps Networks}, author = {Wang, Jue and Lu, Yucheng and Yuan, Binhang and Chen, Beidi and Liang, Percy and De Sa, Christopher and Re, Christopher and Zhang, Ce}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {36058--36076}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/wang23t/wang23t.pdf}, url = {https://proceedings.mlr.press/v202/wang23t.html}, abstract = {Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques – random sparsification, top-K sparsification, and quantization – to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117$\times$ compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs $\sim$1.2$\times$ slowdown compared with data center networks.} }
Endnote
%0 Conference Paper %T CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks %A Jue Wang %A Yucheng Lu %A Binhang Yuan %A Beidi Chen %A Percy Liang %A Christopher De Sa %A Christopher Re %A Ce Zhang %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-wang23t %I PMLR %P 36058--36076 %U https://proceedings.mlr.press/v202/wang23t.html %V 202 %X Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques – random sparsification, top-K sparsification, and quantization – to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117$\times$ compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs $\sim$1.2$\times$ slowdown compared with data center networks.
APA
Wang, J., Lu, Y., Yuan, B., Chen, B., Liang, P., De Sa, C., Re, C. & Zhang, C.. (2023). CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:36058-36076 Available from https://proceedings.mlr.press/v202/wang23t.html.

Related Material