NExtLong: Toward Effective Long-Context Training without Long Documents

Chaochen Gao, Xing W, Zijia Lin, Debing Zhang, Songlin Hu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:18566-18584, 2025.

Abstract

Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong’s ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gao25n, title = {{NE}xt{L}ong: Toward Effective Long-Context Training without Long Documents}, author = {Gao, Chaochen and W, Xing and Lin, Zijia and Zhang, Debing and Hu, Songlin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {18566--18584}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gao25n/gao25n.pdf}, url = {https://proceedings.mlr.press/v267/gao25n.html}, abstract = {Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong’s ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.} }
Endnote
%0 Conference Paper %T NExtLong: Toward Effective Long-Context Training without Long Documents %A Chaochen Gao %A Xing W %A Zijia Lin %A Debing Zhang %A Songlin Hu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gao25n %I PMLR %P 18566--18584 %U https://proceedings.mlr.press/v267/gao25n.html %V 267 %X Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong’s ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
APA
Gao, C., W, X., Lin, Z., Zhang, D. & Hu, S.. (2025). NExtLong: Toward Effective Long-Context Training without Long Documents. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:18566-18584 Available from https://proceedings.mlr.press/v267/gao25n.html.

Related Material