[edit]
Random Reshuffling Dominates Stochastic Gradient Descent
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4859-4882, 2026.
Abstract
Stochastic Gradient Descent ({\textsf{SGD}}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of {\textsf{SGD}} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ({\textsf{Shuffling SGD}}). A particularly popular strategy in {\textsf{Shuffling SGD}} is Random Reshuffling ({\textsf{RR}}), which has achieved great empirical success across numerous experiments. Despite its strong performance, {\textsf{RR}} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for {\textsf{RR}}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of {\textsf{RR}} remain to this day. More precisely, according to the current theory, {\textsf{Shuffling SGD}} under {\textsf{RR}} converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of {\textsf{Shuffling SGD}} under {\textsf{RR}} is strictly worse than that of {\textsf{SGD}} when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that {\textsf{RR}} dominates {\textsf{SGD}} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.