Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes; Gopeshh Subbaraj; Matthew Riemer; Nizar Islah; Tsuguchika Tabaru; Hiroaki Kingetsu; Sarath Chandar; Irina Rish

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish

Proceedings of The 4th Conference on Lifelong Learning Agents, PMLR 330:465-486, 2026.

Abstract

Training large language models (LLMs) typically involves pretraining on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pretraining, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pretraining and propose an efficient implementation of meta-experience replay (MER) (Riemer et al., 2019) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.

Cite this Paper

BibTeX

@InProceedings{pmlr-v330-abbes26a,
  title = 	 {Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models},
  author =       {Abbes, Istabrak and Subbaraj, Gopeshh and Riemer, Matthew and Islah, Nizar and Tabaru, Tsuguchika and Kingetsu, Hiroaki and Chandar, Sarath and Rish, Irina},
  booktitle = 	 {Proceedings of The 4th Conference on Lifelong Learning Agents},
  pages = 	 {465--486},
  year = 	 {2026},
  editor = 	 {Chandar, Sarath and Pascanu, Razvan and Eaton, Eric and Liu, Bing and Mahmood, Rupam and Rannen-Triki, Amal},
  volume = 	 {330},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v330/main/assets/abbes26a/abbes26a.pdf},
  url = 	 {https://proceedings.mlr.press/v330/abbes26a.html},
  abstract = 	 {Training large language models (LLMs) typically involves pretraining on massive corpora, only to restart the process entirely when new data becomes available.  A more efficient and resource-conserving approach would be continual pretraining, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks.  In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pretraining and propose an efficient implementation of meta-experience replay (MER) (Riemer et al., 2019) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.}
}

Endnote

%0 Conference Paper
%T Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models
%A Istabrak Abbes
%A Gopeshh Subbaraj
%A Matthew Riemer
%A Nizar Islah
%A Tsuguchika Tabaru
%A Hiroaki Kingetsu
%A Sarath Chandar
%A Irina Rish
%B Proceedings of The 4th Conference on Lifelong Learning Agents
%C Proceedings of Machine Learning Research
%D 2026
%E Sarath Chandar
%E Razvan Pascanu
%E Eric Eaton
%E Bing Liu
%E Rupam Mahmood
%E Amal Rannen-Triki	
%F pmlr-v330-abbes26a
%I PMLR
%P 465--486
%U https://proceedings.mlr.press/v330/abbes26a.html
%V 330
%X Training large language models (LLMs) typically involves pretraining on massive corpora, only to restart the process entirely when new data becomes available.  A more efficient and resource-conserving approach would be continual pretraining, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks.  In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pretraining and propose an efficient implementation of meta-experience replay (MER) (Riemer et al., 2019) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.

APA

Abbes, I., Subbaraj, G., Riemer, M., Islah, N., Tabaru, T., Kingetsu, H., Chandar, S. & Rish, I.. (2026). Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models. Proceedings of The 4th Conference on Lifelong Learning Agents, in Proceedings of Machine Learning Research 330:465-486 Available from https://proceedings.mlr.press/v330/abbes26a.html.

Related Material

Download PDF