Random Reshuffling Dominates Stochastic Gradient Descent

Zijian Liu
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4859-4882, 2026.

Abstract

Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-liu26d, title = {Random Reshuffling Dominates Stochastic Gradient Descent}, author = {Liu, Zijian}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {4859--4882}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/liu26d/liu26d.pdf}, url = {https://proceedings.mlr.press/v336/liu26d.html}, abstract = {Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.} }
Endnote
%0 Conference Paper %T Random Reshuffling Dominates Stochastic Gradient Descent %A Zijian Liu %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-liu26d %I PMLR %P 4859--4882 %U https://proceedings.mlr.press/v336/liu26d.html %V 336 %X Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.
APA
Liu, Z.. (2026). Random Reshuffling Dominates Stochastic Gradient Descent. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4859-4882 Available from https://proceedings.mlr.press/v336/liu26d.html.

Related Material