Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition

Mohamad Amin Mohamadi; Zhiyuan Li; Lei Wu; Danica J. Sutherland

Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition

Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:35934-35967, 2024.

Abstract

We present a theoretical explanation of the “grokking” phenomenon (Power et al., 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded

$\ell_\infty$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small

$\ell_\infty$ regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-mohamadi24a,
  title = 	 {Why Do You Grok? {A} Theoretical Analysis on Grokking Modular Addition},
  author =       {Mohamadi, Mohamad Amin and Li, Zhiyuan and Wu, Lei and Sutherland, Danica J.},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {35934--35967},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/mohamadi24a/mohamadi24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/mohamadi24a.html},
  abstract = 	 {We present a theoretical explanation of the “grokking” phenomenon (Power et al., 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded $\ell_\infty$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_\infty$ regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.}
}

Endnote

%0 Conference Paper
%T Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition
%A Mohamad Amin Mohamadi
%A Zhiyuan Li
%A Lei Wu
%A Danica J. Sutherland
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-mohamadi24a
%I PMLR
%P 35934--35967
%U https://proceedings.mlr.press/v235/mohamadi24a.html
%V 235
%X We present a theoretical explanation of the “grokking” phenomenon (Power et al., 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded $\ell_\infty$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_\infty$ regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

APA


Mohamadi, M.A., Li, Z., Wu, L. & Sutherland, D.J.. (2024). Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:35934-35967 Available from https://proceedings.mlr.press/v235/mohamadi24a.html.

Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition

Abstract

Cite this Paper

Related Material