Maximizing Intermediate Checkpoint Value in LLM Pretraining with Bayesian Optimization

Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Dianbo Sui
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:39713-39741, 2025.

Abstract

The rapid proliferation of large language models (LLMs), such as GPT-4 and Gemini, underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. In this paper, we introduce a novel checkpoint merging strategy aimed at making efficient use of intermediate checkpoints during LLM pretraining. This method utilizes intermediate checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-liu25bv, title = {Maximizing Intermediate Checkpoint Value in {LLM} Pretraining with {B}ayesian Optimization}, author = {Liu, Deyuan and Wang, Zecheng and Wang, Bingning and Chen, Weipeng and Li, Chunshan and Tu, Zhiying and Chu, Dianhui and Sui, Dianbo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {39713--39741}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/liu25bv/liu25bv.pdf}, url = {https://proceedings.mlr.press/v267/liu25bv.html}, abstract = {The rapid proliferation of large language models (LLMs), such as GPT-4 and Gemini, underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. In this paper, we introduce a novel checkpoint merging strategy aimed at making efficient use of intermediate checkpoints during LLM pretraining. This method utilizes intermediate checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.} }
Endnote
%0 Conference Paper %T Maximizing Intermediate Checkpoint Value in LLM Pretraining with Bayesian Optimization %A Deyuan Liu %A Zecheng Wang %A Bingning Wang %A Weipeng Chen %A Chunshan Li %A Zhiying Tu %A Dianhui Chu %A Dianbo Sui %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-liu25bv %I PMLR %P 39713--39741 %U https://proceedings.mlr.press/v267/liu25bv.html %V 267 %X The rapid proliferation of large language models (LLMs), such as GPT-4 and Gemini, underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. In this paper, we introduce a novel checkpoint merging strategy aimed at making efficient use of intermediate checkpoints during LLM pretraining. This method utilizes intermediate checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
APA
Liu, D., Wang, Z., Wang, B., Chen, W., Li, C., Tu, Z., Chu, D. & Sui, D.. (2025). Maximizing Intermediate Checkpoint Value in LLM Pretraining with Bayesian Optimization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:39713-39741 Available from https://proceedings.mlr.press/v267/liu25bv.html.

Related Material