Random Shuffling Beats SGD after Finite Epochs
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:26242633, 2019.
Abstract
A longstanding problem in stochastic optimization is proving that \rsgd, the withoutreplacement version of \sgd, converges faster than the usual withreplacement \sgd. Building upon \citep{gurbuzbalaban2015random}, we present the first (to our knowledge) nonasymptotic results for this problem by proving that after a reasonable number of epochs \rsgd converges faster than \sgd. Specifically, we prove that for strongly convex, secondorder smooth functions, the iterates of \rsgd converge to the optimal solution as $\mathcal{O}(\nicefrac{1}{T^2} + \nicefrac{n^3}{T^3})$, where $n$ is the number of components in the objective, and $T$ is number of iterations. This result implies that after $\mathcal{O}(\sqrt{n})$ epochs, \rsgd is strictly better than \sgd (which converges as $\mathcal{O}(\nicefrac{1}{T})$). The key step toward showing this better dependence on $T$ is the introduction of $n$ into the bound; and as our analysis shows, in general a dependence on $n$ is unavoidable without further changes. To understand how \rsgd works in practice, we further explore two empirically useful settings: data sparsity and overparameterization. For sparse data, \rsgd has the rate $\mathcal{O}\left(\frac{1}{T^2}\right)$, again strictly better than \sgd. Under a setting closely related to overparameterization, \rsgd is shown to converge faster than \sgd after any arbitrary number of iterations. Finally, we extend the analysis of \rsgd to smooth nonconvex and convex functions.
Related Material


