Data Efficient Neural Scaling Law via Model Reusing

Peihao Wang, Rameswar Panda, Zhangyang Wang
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:36193-36204, 2023.

Abstract

The number of parameters in large transformers has been observed to grow exponentially. Despite notable performance improvements, concerns have been raised that such a growing model size will run out of data in the near future. As manifested in the neural scaling law, modern learning backbones are not data-efficient. To maintain the utility of the model capacity, training data should be increased proportionally. In this paper, we study the neural scaling law under the previously overlooked data scarcity regime, focusing on the more challenging situation where we need to train a gigantic model with a disproportionately limited supply of available training data. We find that the existing power laws underestimate the data inefficiency of large transformers. Their performance will drop significantly if the training set is insufficient. Fortunately, we discover another blessing - such a data-inefficient scaling law can be restored through a model reusing approach that warm-starts the training of a large model by initializing it using smaller models. Our empirical study shows that model reusing can effectively reproduce the power law under the data scarcity regime. When progressively applying model reusing to expand the model size, we also observe consistent performance improvement in large transformers. We release our code at: https://github.com/VITA-Group/Data-Efficient-Scaling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-wang23aa, title = {Data Efficient Neural Scaling Law via Model Reusing}, author = {Wang, Peihao and Panda, Rameswar and Wang, Zhangyang}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {36193--36204}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/wang23aa/wang23aa.pdf}, url = {https://proceedings.mlr.press/v202/wang23aa.html}, abstract = {The number of parameters in large transformers has been observed to grow exponentially. Despite notable performance improvements, concerns have been raised that such a growing model size will run out of data in the near future. As manifested in the neural scaling law, modern learning backbones are not data-efficient. To maintain the utility of the model capacity, training data should be increased proportionally. In this paper, we study the neural scaling law under the previously overlooked data scarcity regime, focusing on the more challenging situation where we need to train a gigantic model with a disproportionately limited supply of available training data. We find that the existing power laws underestimate the data inefficiency of large transformers. Their performance will drop significantly if the training set is insufficient. Fortunately, we discover another blessing - such a data-inefficient scaling law can be restored through a model reusing approach that warm-starts the training of a large model by initializing it using smaller models. Our empirical study shows that model reusing can effectively reproduce the power law under the data scarcity regime. When progressively applying model reusing to expand the model size, we also observe consistent performance improvement in large transformers. We release our code at: https://github.com/VITA-Group/Data-Efficient-Scaling.} }
Endnote
%0 Conference Paper %T Data Efficient Neural Scaling Law via Model Reusing %A Peihao Wang %A Rameswar Panda %A Zhangyang Wang %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-wang23aa %I PMLR %P 36193--36204 %U https://proceedings.mlr.press/v202/wang23aa.html %V 202 %X The number of parameters in large transformers has been observed to grow exponentially. Despite notable performance improvements, concerns have been raised that such a growing model size will run out of data in the near future. As manifested in the neural scaling law, modern learning backbones are not data-efficient. To maintain the utility of the model capacity, training data should be increased proportionally. In this paper, we study the neural scaling law under the previously overlooked data scarcity regime, focusing on the more challenging situation where we need to train a gigantic model with a disproportionately limited supply of available training data. We find that the existing power laws underestimate the data inefficiency of large transformers. Their performance will drop significantly if the training set is insufficient. Fortunately, we discover another blessing - such a data-inefficient scaling law can be restored through a model reusing approach that warm-starts the training of a large model by initializing it using smaller models. Our empirical study shows that model reusing can effectively reproduce the power law under the data scarcity regime. When progressively applying model reusing to expand the model size, we also observe consistent performance improvement in large transformers. We release our code at: https://github.com/VITA-Group/Data-Efficient-Scaling.
APA
Wang, P., Panda, R. & Wang, Z.. (2023). Data Efficient Neural Scaling Law via Model Reusing. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:36193-36204 Available from https://proceedings.mlr.press/v202/wang23aa.html.

Related Material