Deep Fusion: Efficient Network Training via Pre-trained Initializations

Hanna Mazzawi, Javier Gonzalvo, Michael Wunder, Sammy Jerome, Benoit Dherin
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:35225-35239, 2024.

Abstract

Training deep neural networks for large language models (LLMs) remains computationally very expensive. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. In this paper, we propose a theoretical framework using backward error analysis to illuminate the dynamics of mid-training network growth. Furthermore, we introduce Deep Fusion, an efficient network training approach that leverages pre-trained initializations of smaller networks, facilitating network growth from diverse sources. Our experiments validate the power of our theoretical framework in guiding the optimal use of Deep Fusion. With carefully optimized training dynamics, Deep Fusion demonstrates significant reductions in both training time and resource consumption. Importantly, these gains are achieved without sacrificing performance. We demonstrate reduced computational requirements, and improved generalization performance on a variety of NLP tasks and T5 model sizes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-mazzawi24a, title = {Deep Fusion: Efficient Network Training via Pre-trained Initializations}, author = {Mazzawi, Hanna and Gonzalvo, Javier and Wunder, Michael and Jerome, Sammy and Dherin, Benoit}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {35225--35239}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/mazzawi24a/mazzawi24a.pdf}, url = {https://proceedings.mlr.press/v235/mazzawi24a.html}, abstract = {Training deep neural networks for large language models (LLMs) remains computationally very expensive. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. In this paper, we propose a theoretical framework using backward error analysis to illuminate the dynamics of mid-training network growth. Furthermore, we introduce Deep Fusion, an efficient network training approach that leverages pre-trained initializations of smaller networks, facilitating network growth from diverse sources. Our experiments validate the power of our theoretical framework in guiding the optimal use of Deep Fusion. With carefully optimized training dynamics, Deep Fusion demonstrates significant reductions in both training time and resource consumption. Importantly, these gains are achieved without sacrificing performance. We demonstrate reduced computational requirements, and improved generalization performance on a variety of NLP tasks and T5 model sizes.} }
Endnote
%0 Conference Paper %T Deep Fusion: Efficient Network Training via Pre-trained Initializations %A Hanna Mazzawi %A Javier Gonzalvo %A Michael Wunder %A Sammy Jerome %A Benoit Dherin %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-mazzawi24a %I PMLR %P 35225--35239 %U https://proceedings.mlr.press/v235/mazzawi24a.html %V 235 %X Training deep neural networks for large language models (LLMs) remains computationally very expensive. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. In this paper, we propose a theoretical framework using backward error analysis to illuminate the dynamics of mid-training network growth. Furthermore, we introduce Deep Fusion, an efficient network training approach that leverages pre-trained initializations of smaller networks, facilitating network growth from diverse sources. Our experiments validate the power of our theoretical framework in guiding the optimal use of Deep Fusion. With carefully optimized training dynamics, Deep Fusion demonstrates significant reductions in both training time and resource consumption. Importantly, these gains are achieved without sacrificing performance. We demonstrate reduced computational requirements, and improved generalization performance on a variety of NLP tasks and T5 model sizes.
APA
Mazzawi, H., Gonzalvo, J., Wunder, M., Jerome, S. & Dherin, B.. (2024). Deep Fusion: Efficient Network Training via Pre-trained Initializations. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:35225-35239 Available from https://proceedings.mlr.press/v235/mazzawi24a.html.

Related Material