Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

Wenlong Mou, Liwei Wang, Xiyu Zhai, Kai Zheng
Proceedings of the 31st Conference On Learning Theory, PMLR 75:605-638, 2018.

Abstract

We study the generalization errors of \emph{non-convex} regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_N}\right)$, where $L$ is Lipschitz parameter, $\beta$ is inverse temperature, and $T_N$ is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of the parameter. The bounds characterize how the noises in the algorithm itself controls the statistical learning behavior in non-convex problems, without uniform convergence in the hypothesis space, which sheds light on the effect of training algorithms on the generalization error for deep neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v75-mou18a, title = {Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints}, author = {Mou, Wenlong and Wang, Liwei and Zhai, Xiyu and Zheng, Kai}, booktitle = {Proceedings of the 31st Conference On Learning Theory}, pages = {605--638}, year = {2018}, editor = {Bubeck, Sébastien and Perchet, Vianney and Rigollet, Philippe}, volume = {75}, series = {Proceedings of Machine Learning Research}, month = {06--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v75/mou18a/mou18a.pdf}, url = {https://proceedings.mlr.press/v75/mou18a.html}, abstract = {We study the generalization errors of \emph{non-convex} regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_N}\right)$, where $L$ is Lipschitz parameter, $\beta$ is inverse temperature, and $T_N$ is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of the parameter. The bounds characterize how the noises in the algorithm itself controls the statistical learning behavior in non-convex problems, without uniform convergence in the hypothesis space, which sheds light on the effect of training algorithms on the generalization error for deep neural networks.} }
Endnote
%0 Conference Paper %T Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints %A Wenlong Mou %A Liwei Wang %A Xiyu Zhai %A Kai Zheng %B Proceedings of the 31st Conference On Learning Theory %C Proceedings of Machine Learning Research %D 2018 %E Sébastien Bubeck %E Vianney Perchet %E Philippe Rigollet %F pmlr-v75-mou18a %I PMLR %P 605--638 %U https://proceedings.mlr.press/v75/mou18a.html %V 75 %X We study the generalization errors of \emph{non-convex} regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_N}\right)$, where $L$ is Lipschitz parameter, $\beta$ is inverse temperature, and $T_N$ is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of the parameter. The bounds characterize how the noises in the algorithm itself controls the statistical learning behavior in non-convex problems, without uniform convergence in the hypothesis space, which sheds light on the effect of training algorithms on the generalization error for deep neural networks.
APA
Mou, W., Wang, L., Zhai, X. & Zheng, K.. (2018). Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints. Proceedings of the 31st Conference On Learning Theory, in Proceedings of Machine Learning Research 75:605-638 Available from https://proceedings.mlr.press/v75/mou18a.html.

Related Material