PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:61942-61955, 2025.

Abstract

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wan25c, title = {{P}ipe{O}ffload: Improving Scalability of Pipeline Parallelism with Memory Optimization}, author = {Wan, Xinyi and Qi, Penghui and Huang, Guangxing and Lin, Min and Li, Jialin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {61942--61955}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wan25c/wan25c.pdf}, url = {https://proceedings.mlr.press/v267/wan25c.html}, abstract = {Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption.} }
Endnote
%0 Conference Paper %T PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization %A Xinyi Wan %A Penghui Qi %A Guangxing Huang %A Min Lin %A Jialin Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wan25c %I PMLR %P 61942--61955 %U https://proceedings.mlr.press/v267/wan25c.html %V 267 %X Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption.
APA
Wan, X., Qi, P., Huang, G., Lin, M. & Li, J.. (2025). PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:61942-61955 Available from https://proceedings.mlr.press/v267/wan25c.html.

Related Material