Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition

Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:35934-35967, 2024.

Abstract

We present a theoretical explanation of the “grokking” phenomenon (Power et al., 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-mohamadi24a, title = {Why Do You Grok? {A} Theoretical Analysis on Grokking Modular Addition}, author = {Mohamadi, Mohamad Amin and Li, Zhiyuan and Wu, Lei and Sutherland, Danica J.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {35934--35967}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/mohamadi24a/mohamadi24a.pdf}, url = {https://proceedings.mlr.press/v235/mohamadi24a.html}, abstract = {We present a theoretical explanation of the “grokking” phenomenon (Power et al., 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded $\ell_\infty$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_\infty$ regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.} }
Endnote
%0 Conference Paper %T Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition %A Mohamad Amin Mohamadi %A Zhiyuan Li %A Lei Wu %A Danica J. Sutherland %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-mohamadi24a %I PMLR %P 35934--35967 %U https://proceedings.mlr.press/v235/mohamadi24a.html %V 235 %X We present a theoretical explanation of the “grokking” phenomenon (Power et al., 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded $\ell_\infty$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_\infty$ regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.
APA
Mohamadi, M.A., Li, Z., Wu, L. & Sutherland, D.J.. (2024). Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:35934-35967 Available from https://proceedings.mlr.press/v235/mohamadi24a.html.

Related Material