Generalization Analysis of Stochastic Weight Averaging with General Sampling

Peng Wang, Li Shen, Zerui Tao, Shuaida He, Dacheng Tao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:51442-51464, 2024.

Abstract

Stochastic weight averaging (SWA) method has empirically proven its advantages compared to stochastic gradient descent (SGD). Despite it is widespread used, theoretical investigations have been limited, particularly in scenarios beyond the ideal setting of convex and sampling with replacement. However, non-convex cases and sampling without replacement are very practical in real-world applications. The main challenges under the above settings are two-folds: (i) All the historical gradient information introduced by SWA is considered, while the analysis of SGD using the tool of uniform stability requires only to bound the current gradient. (ii) The $(1+\alpha\beta)$-expansion property causes the boundary of each gradient step dependent on the previous step, making the boundary of each historical gradient in SWA nested and the theoretical analysis even harder. To address the theoretical challenges, we adopt mathematical induction to find a recursive representation that bounds the gradient at each step. Based on this, we establish stability bounds supporting sampling with and without replacement in the non-convex setting. Furthermore, the derived generalization bounds of SWA are sharper than SGD. At last, experimental results on several benchmarks verify our theoretical results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-wang24bl, title = {Generalization Analysis of Stochastic Weight Averaging with General Sampling}, author = {Wang, Peng and Shen, Li and Tao, Zerui and He, Shuaida and Tao, Dacheng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {51442--51464}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/wang24bl/wang24bl.pdf}, url = {https://proceedings.mlr.press/v235/wang24bl.html}, abstract = {Stochastic weight averaging (SWA) method has empirically proven its advantages compared to stochastic gradient descent (SGD). Despite it is widespread used, theoretical investigations have been limited, particularly in scenarios beyond the ideal setting of convex and sampling with replacement. However, non-convex cases and sampling without replacement are very practical in real-world applications. The main challenges under the above settings are two-folds: (i) All the historical gradient information introduced by SWA is considered, while the analysis of SGD using the tool of uniform stability requires only to bound the current gradient. (ii) The $(1+\alpha\beta)$-expansion property causes the boundary of each gradient step dependent on the previous step, making the boundary of each historical gradient in SWA nested and the theoretical analysis even harder. To address the theoretical challenges, we adopt mathematical induction to find a recursive representation that bounds the gradient at each step. Based on this, we establish stability bounds supporting sampling with and without replacement in the non-convex setting. Furthermore, the derived generalization bounds of SWA are sharper than SGD. At last, experimental results on several benchmarks verify our theoretical results.} }
Endnote
%0 Conference Paper %T Generalization Analysis of Stochastic Weight Averaging with General Sampling %A Peng Wang %A Li Shen %A Zerui Tao %A Shuaida He %A Dacheng Tao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-wang24bl %I PMLR %P 51442--51464 %U https://proceedings.mlr.press/v235/wang24bl.html %V 235 %X Stochastic weight averaging (SWA) method has empirically proven its advantages compared to stochastic gradient descent (SGD). Despite it is widespread used, theoretical investigations have been limited, particularly in scenarios beyond the ideal setting of convex and sampling with replacement. However, non-convex cases and sampling without replacement are very practical in real-world applications. The main challenges under the above settings are two-folds: (i) All the historical gradient information introduced by SWA is considered, while the analysis of SGD using the tool of uniform stability requires only to bound the current gradient. (ii) The $(1+\alpha\beta)$-expansion property causes the boundary of each gradient step dependent on the previous step, making the boundary of each historical gradient in SWA nested and the theoretical analysis even harder. To address the theoretical challenges, we adopt mathematical induction to find a recursive representation that bounds the gradient at each step. Based on this, we establish stability bounds supporting sampling with and without replacement in the non-convex setting. Furthermore, the derived generalization bounds of SWA are sharper than SGD. At last, experimental results on several benchmarks verify our theoretical results.
APA
Wang, P., Shen, L., Tao, Z., He, S. & Tao, D.. (2024). Generalization Analysis of Stochastic Weight Averaging with General Sampling. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:51442-51464 Available from https://proceedings.mlr.press/v235/wang24bl.html.

Related Material