Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1730-1762, 2024.

Abstract

We study the impact of the batch size nb on the iteration time T of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches nb minimizes the training time without changing the total sample complexity, where \ell is the information exponent of the target to be learned and d is the input dimension. However, larger batch sizes than n_b \gg d^{\frac{\ell}{2}} are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-arnaboldi24a, title = {Online Learning and Information Exponents: The Importance of Batch size & {T}ime/{C}omplexity Tradeoffs}, author = {Arnaboldi, Luca and Dandi, Yatin and Krzakala, Florent and Loureiro, Bruno and Pesce, Luca and Stephan, Ludovic}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {1730--1762}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/arnaboldi24a/arnaboldi24a.pdf}, url = {https://proceedings.mlr.press/v235/arnaboldi24a.html}, abstract = {We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.} }
Endnote
%0 Conference Paper %T Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs %A Luca Arnaboldi %A Yatin Dandi %A Florent Krzakala %A Bruno Loureiro %A Luca Pesce %A Ludovic Stephan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-arnaboldi24a %I PMLR %P 1730--1762 %U https://proceedings.mlr.press/v235/arnaboldi24a.html %V 235 %X We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.
APA
Arnaboldi, L., Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L.. (2024). Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1730-1762 Available from https://proceedings.mlr.press/v235/arnaboldi24a.html.

Related Material