Demystifying SGD with Doubly Stochastic Gradients

Kyurae Kim, Joohwan Ko, Yian Ma, Jacob R. Gardner
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:24210-24247, 2024.

Abstract

Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g.,, diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-kim24r, title = {Demystifying {SGD} with Doubly Stochastic Gradients}, author = {Kim, Kyurae and Ko, Joohwan and Ma, Yian and Gardner, Jacob R.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {24210--24247}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24r/kim24r.pdf}, url = {https://proceedings.mlr.press/v235/kim24r.html}, abstract = {Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g.,, diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.} }
Endnote
%0 Conference Paper %T Demystifying SGD with Doubly Stochastic Gradients %A Kyurae Kim %A Joohwan Ko %A Yian Ma %A Jacob R. Gardner %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-kim24r %I PMLR %P 24210--24247 %U https://proceedings.mlr.press/v235/kim24r.html %V 235 %X Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g.,, diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.
APA
Kim, K., Ko, J., Ma, Y. & Gardner, J.R.. (2024). Demystifying SGD with Doubly Stochastic Gradients. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:24210-24247 Available from https://proceedings.mlr.press/v235/kim24r.html.

Related Material