Understanding Synthetic Context Extension via Retrieval Heads

Xinyu Zhao, Fangcong Yin, Greg Durrett
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:77885-77910, 2025.

Abstract

Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle” concepts to be retrieved and diversity of the surrounding "haystack” context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhao25ad, title = {Understanding Synthetic Context Extension via Retrieval Heads}, author = {Zhao, Xinyu and Yin, Fangcong and Durrett, Greg}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {77885--77910}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhao25ad/zhao25ad.pdf}, url = {https://proceedings.mlr.press/v267/zhao25ad.html}, abstract = {Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle” concepts to be retrieved and diversity of the surrounding "haystack” context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.} }
Endnote
%0 Conference Paper %T Understanding Synthetic Context Extension via Retrieval Heads %A Xinyu Zhao %A Fangcong Yin %A Greg Durrett %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhao25ad %I PMLR %P 77885--77910 %U https://proceedings.mlr.press/v267/zhao25ad.html %V 267 %X Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle” concepts to be retrieved and diversity of the surrounding "haystack” context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.
APA
Zhao, X., Yin, F. & Durrett, G.. (2025). Understanding Synthetic Context Extension via Retrieval Heads. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:77885-77910 Available from https://proceedings.mlr.press/v267/zhao25ad.html.

Related Material