Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Atsushi Nitanda, Ryuhei Kikuchi, Shugo Maeda, Denny Wu
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3565-3573, 2024.

Abstract

It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-nitanda24a, title = { Why is parameter averaging beneficial in {SGD}? An objective smoothing perspective }, author = {Nitanda, Atsushi and Kikuchi, Ryuhei and Maeda, Shugo and Wu, Denny}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {3565--3573}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/nitanda24a/nitanda24a.pdf}, url = {https://proceedings.mlr.press/v238/nitanda24a.html}, abstract = { It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD. } }
Endnote
%0 Conference Paper %T Why is parameter averaging beneficial in SGD? An objective smoothing perspective %A Atsushi Nitanda %A Ryuhei Kikuchi %A Shugo Maeda %A Denny Wu %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-nitanda24a %I PMLR %P 3565--3573 %U https://proceedings.mlr.press/v238/nitanda24a.html %V 238 %X It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.
APA
Nitanda, A., Kikuchi, R., Maeda, S. & Wu, D.. (2024). Why is parameter averaging beneficial in SGD? An objective smoothing perspective . Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3565-3573 Available from https://proceedings.mlr.press/v238/nitanda24a.html.

Related Material