Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham M. Kakade, Boaz Barak
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49698-49716, 2024.

Abstract

The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes (”SGD noise”). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the ”golden path” of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-vyas24a, title = {Beyond Implicit Bias: The Insignificance of {SGD} Noise in Online Learning}, author = {Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Kaplun, Gal and Kakade, Sham M. and Barak, Boaz}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {49698--49716}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/vyas24a/vyas24a.pdf}, url = {https://proceedings.mlr.press/v235/vyas24a.html}, abstract = {The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes (”SGD noise”). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the ”golden path” of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.} }
Endnote
%0 Conference Paper %T Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning %A Nikhil Vyas %A Depen Morwani %A Rosie Zhao %A Gal Kaplun %A Sham M. Kakade %A Boaz Barak %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-vyas24a %I PMLR %P 49698--49716 %U https://proceedings.mlr.press/v235/vyas24a.html %V 235 %X The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes (”SGD noise”). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the ”golden path” of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.
APA
Vyas, N., Morwani, D., Zhao, R., Kaplun, G., Kakade, S.M. & Barak, B.. (2024). Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:49698-49716 Available from https://proceedings.mlr.press/v235/vyas24a.html.

Related Material