What can grokking teach us about learning under non-stationarity?

Clare Lyle; Ghada Sokar; András György; Razvan Pascanu

What can grokking teach us about learning under non-stationarity?

Clare Lyle, Ghada Sokar, András György, Razvan Pascanu

Proceedings of The 4th Conference on Lifelong Learning Agents, PMLR 330:635-656, 2026.

Abstract

In continual learning problems, it is often necessary to overwrite components of a neural network’s learned representation in response to changes in the data stream; however, neural networks often exhibit \textit{primacy bias}, whereby early training data hinders the network’s ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of \textit{grokking}, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous \textit{learned} features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the \textit{effective} learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v330-lyle26a,
  title = 	 {What can grokking teach us about learning under non-stationarity?},
  author =       {Lyle, Clare and Sokar, Ghada and Gy\"{o}rgy, Andr\'{a}s and Pascanu, Razvan},
  booktitle = 	 {Proceedings of The 4th Conference on Lifelong Learning Agents},
  pages = 	 {635--656},
  year = 	 {2026},
  editor = 	 {Chandar, Sarath and Pascanu, Razvan and Eaton, Eric and Liu, Bing and Mahmood, Rupam and Rannen-Triki, Amal},
  volume = 	 {330},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v330/main/assets/lyle26a/lyle26a.pdf},
  url = 	 {https://proceedings.mlr.press/v330/lyle26a.html},
  abstract = 	 {In continual learning problems, it is often necessary to overwrite components of a neural network’s learned representation in response to changes in the data stream; however, neural networks often exhibit \textit{primacy bias}, whereby early training data hinders the network’s ability to generalize on later tasks.  While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of \textit{grokking}, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous \textit{learned} features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the \textit{effective} learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.}
}

Endnote

%0 Conference Paper
%T What can grokking teach us about learning under non-stationarity?
%A Clare Lyle
%A Ghada Sokar
%A András György
%A Razvan Pascanu
%B Proceedings of The 4th Conference on Lifelong Learning Agents
%C Proceedings of Machine Learning Research
%D 2026
%E Sarath Chandar
%E Razvan Pascanu
%E Eric Eaton
%E Bing Liu
%E Rupam Mahmood
%E Amal Rannen-Triki	
%F pmlr-v330-lyle26a
%I PMLR
%P 635--656
%U https://proceedings.mlr.press/v330/lyle26a.html
%V 330
%X In continual learning problems, it is often necessary to overwrite components of a neural network’s learned representation in response to changes in the data stream; however, neural networks often exhibit \textit{primacy bias}, whereby early training data hinders the network’s ability to generalize on later tasks.  While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of \textit{grokking}, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous \textit{learned} features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the \textit{effective} learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.

APA

Lyle, C., Sokar, G., György, A. & Pascanu, R.. (2026). What can grokking teach us about learning under non-stationarity?. Proceedings of The 4th Conference on Lifelong Learning Agents, in Proceedings of Machine Learning Research 330:635-656 Available from https://proceedings.mlr.press/v330/lyle26a.html.

Related Material

Download PDF