Efficient Long Context Fine-tuning with Chunk Flow

Xiulong Yuan; Hongtao Xu; Wenting Shen; Ang Wang; Xiafei Qiu; Jie Zhang; Yuqiong Liu; Bowen Yu; Junyang Lin; Mingzhen Li; Weile Jia; Yong Li; Wei Lin

Efficient Long Context Fine-tuning with Chunk Flow

Xiulong Yuan, Hongtao Xu, Wenting Shen, Ang Wang, Xiafei Qiu, Jie Zhang, Yuqiong Liu, Bowen Yu, Junyang Lin, Mingzhen Li, Weile Jia, Yong Li, Wei Lin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:73732-73742, 2025.

Abstract

Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-yuan25m,
  title = 	 {Efficient Long Context Fine-tuning with Chunk Flow},
  author =       {Yuan, Xiulong and Xu, Hongtao and Shen, Wenting and Wang, Ang and Qiu, Xiafei and Zhang, Jie and Liu, Yuqiong and Yu, Bowen and Lin, Junyang and Li, Mingzhen and Jia, Weile and Li, Yong and Lin, Wei},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {73732--73742},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yuan25m/yuan25m.pdf},
  url = 	 {https://proceedings.mlr.press/v267/yuan25m.html},
  abstract = 	 {Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.}
}

Endnote

%0 Conference Paper
%T Efficient Long Context Fine-tuning with Chunk Flow
%A Xiulong Yuan
%A Hongtao Xu
%A Wenting Shen
%A Ang Wang
%A Xiafei Qiu
%A Jie Zhang
%A Yuqiong Liu
%A Bowen Yu
%A Junyang Lin
%A Mingzhen Li
%A Weile Jia
%A Yong Li
%A Wei Lin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-yuan25m
%I PMLR
%P 73732--73742
%U https://proceedings.mlr.press/v267/yuan25m.html
%V 267
%X Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.

APA

Yuan, X., Xu, H., Shen, W., Wang, A., Qiu, X., Zhang, J., Liu, Y., Yu, B., Lin, J., Li, M., Jia, W., Li, Y. & Lin, W.. (2025). Efficient Long Context Fine-tuning with Chunk Flow. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:73732-73742 Available from https://proceedings.mlr.press/v267/yuan25m.html.

Efficient Long Context Fine-tuning with Chunk Flow

Abstract

Cite this Paper

Related Material