Training-Free Long-Context Scaling of Large Language Models

Chenxin An; Fei Huang; Jun Zhang; Shansan Gong; Xipeng Qiu; Chang Zhou; Lingpeng Kong

Training-Free Long-Context Scaling of Large Language Models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1493-1510, 2024.

Abstract

The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose a training-free approach named Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of up to 100k tokens. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of models built through continual training. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-an24b,
  title = 	 {Training-Free Long-Context Scaling of Large Language Models},
  author =       {An, Chenxin and Huang, Fei and Zhang, Jun and Gong, Shansan and Qiu, Xipeng and Zhou, Chang and Kong, Lingpeng},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {1493--1510},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/an24b/an24b.pdf},
  url = 	 {https://proceedings.mlr.press/v235/an24b.html},
  abstract = 	 {The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose a training-free approach named Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of up to 100k tokens. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of models built through continual training. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.}
}

Endnote

%0 Conference Paper
%T Training-Free Long-Context Scaling of Large Language Models
%A Chenxin An
%A Fei Huang
%A Jun Zhang
%A Shansan Gong
%A Xipeng Qiu
%A Chang Zhou
%A Lingpeng Kong
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-an24b
%I PMLR
%P 1493--1510
%U https://proceedings.mlr.press/v235/an24b.html
%V 235
%X The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose a training-free approach named Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of up to 100k tokens. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of models built through continual training. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.

APA


An, C., Huang, F., Zhang, J., Gong, S., Qiu, X., Zhou, C. & Kong, L.. (2024). Training-Free Long-Context Scaling of Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1493-1510 Available from https://proceedings.mlr.press/v235/an24b.html.

Training-Free Long-Context Scaling of Large Language Models

Abstract

Cite this Paper

Related Material