Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Luca Arnaboldi; Yatin Dandi; Florent Krzakala; Bruno Loureiro; Luca Pesce; Ludovic Stephan

Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1730-1762, 2024.

Abstract

We study the impact of the batch size

$n_b$ on the iteration time

$T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches

$n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where

$\ell$ is the information exponent of the target to be learned and

$d$ is the input dimension. However, larger batch sizes than

$n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-arnaboldi24a,
  title = 	 {Online Learning and Information Exponents: The Importance of Batch size & {T}ime/{C}omplexity Tradeoffs},
  author =       {Arnaboldi, Luca and Dandi, Yatin and Krzakala, Florent and Loureiro, Bruno and Pesce, Luca and Stephan, Ludovic},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {1730--1762},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/arnaboldi24a/arnaboldi24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/arnaboldi24a.html},
  abstract = 	 {We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.}
}

Endnote

%0 Conference Paper
%T Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs
%A Luca Arnaboldi
%A Yatin Dandi
%A Florent Krzakala
%A Bruno Loureiro
%A Luca Pesce
%A Ludovic Stephan
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-arnaboldi24a
%I PMLR
%P 1730--1762
%U https://proceedings.mlr.press/v235/arnaboldi24a.html
%V 235
%X We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

APA


Arnaboldi, L., Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L.. (2024). Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1730-1762 Available from https://proceedings.mlr.press/v235/arnaboldi24a.html.

Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Abstract

Cite this Paper

Related Material