Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

Jackie Lok, Rishi Sonthalia, Elizaveta Rebrova
Proceedings of The 36th International Conference on Algorithmic Learning Theory, PMLR 272:736-770, 2025.

Abstract

We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix Z between the original features X and a set of new features ˜X in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing Z, a non-commutative polynomial of random matrices, with the sample covariance matrix of X asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.

Cite this Paper


BibTeX
@InProceedings{pmlr-v272-lok25a, title = {Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression}, author = {Lok, Jackie and Sonthalia, Rishi and Rebrova, Elizaveta}, booktitle = {Proceedings of The 36th International Conference on Algorithmic Learning Theory}, pages = {736--770}, year = {2025}, editor = {Kamath, Gautam and Loh, Po-Ling}, volume = {272}, series = {Proceedings of Machine Learning Research}, month = {24--27 Feb}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v272/main/assets/lok25a/lok25a.pdf}, url = {https://proceedings.mlr.press/v272/lok25a.html}, abstract = {We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.} }
Endnote
%0 Conference Paper %T Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression %A Jackie Lok %A Rishi Sonthalia %A Elizaveta Rebrova %B Proceedings of The 36th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Gautam Kamath %E Po-Ling Loh %F pmlr-v272-lok25a %I PMLR %P 736--770 %U https://proceedings.mlr.press/v272/lok25a.html %V 272 %X We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.
APA
Lok, J., Sonthalia, R. & Rebrova, E.. (2025). Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression. Proceedings of The 36th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 272:736-770 Available from https://proceedings.mlr.press/v272/lok25a.html.

Related Material