Grokking at the Edge of Linear Separability

Alon Beck, Noam Itzhak Levi, Yohai Bar-Sinai
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:3307-3334, 2025.

Abstract

We investigate the phenomenon of grokking – delayed generalization accompanied by non-monotonic test loss behavior – in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined. Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism. Concretely, by appealing to the implicit bias of gradient descent, we show that logistic regression can exhibit grokking when the training dataset is nearly linearly separable from the origin and there is strong noise in the perpendicular directions. The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum. Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-beck25a, title = {Grokking at the Edge of Linear Separability}, author = {Beck, Alon and Levi, Noam Itzhak and Bar-Sinai, Yohai}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {3307--3334}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/beck25a/beck25a.pdf}, url = {https://proceedings.mlr.press/v267/beck25a.html}, abstract = {We investigate the phenomenon of grokking – delayed generalization accompanied by non-monotonic test loss behavior – in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined. Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism. Concretely, by appealing to the implicit bias of gradient descent, we show that logistic regression can exhibit grokking when the training dataset is nearly linearly separable from the origin and there is strong noise in the perpendicular directions. The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum. Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.} }
Endnote
%0 Conference Paper %T Grokking at the Edge of Linear Separability %A Alon Beck %A Noam Itzhak Levi %A Yohai Bar-Sinai %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-beck25a %I PMLR %P 3307--3334 %U https://proceedings.mlr.press/v267/beck25a.html %V 267 %X We investigate the phenomenon of grokking – delayed generalization accompanied by non-monotonic test loss behavior – in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined. Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism. Concretely, by appealing to the implicit bias of gradient descent, we show that logistic regression can exhibit grokking when the training dataset is nearly linearly separable from the origin and there is strong noise in the perpendicular directions. The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum. Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.
APA
Beck, A., Levi, N.I. & Bar-Sinai, Y.. (2025). Grokking at the Edge of Linear Separability. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:3307-3334 Available from https://proceedings.mlr.press/v267/beck25a.html.

Related Material