Random Reshuffling Dominates Stochastic Gradient Descent

Zijian Liu

Random Reshuffling Dominates Stochastic Gradient Descent

Zijian Liu

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4859-4882, 2026.

Abstract

Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-liu26d,
  title = 	 {Random Reshuffling Dominates Stochastic Gradient Descent},
  author =       {Liu, Zijian},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {4859--4882},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/liu26d/liu26d.pdf},
  url = 	 {https://proceedings.mlr.press/v336/liu26d.html},
  abstract = 	 {Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.}
}

Endnote

%0 Conference Paper
%T Random Reshuffling Dominates Stochastic Gradient Descent
%A Zijian Liu
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-liu26d
%I PMLR
%P 4859--4882
%U https://proceedings.mlr.press/v336/liu26d.html
%V 336
%X Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.

APA

Liu, Z.. (2026). Random Reshuffling Dominates Stochastic Gradient Descent. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4859-4882 Available from https://proceedings.mlr.press/v336/liu26d.html.

Related Material

Download PDF