Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon; Daniel Y Fu; Simran Arora; Neel Guha; Christopher Re

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon, Daniel Y Fu, Simran Arora, Neel Guha, Christopher Re

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42918-42946, 2024.

Abstract

Retrieval pipelines are an integral component of many machine learning systems. However, they perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to finetune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long. We describe a pretraining data mixture which allows this encoder to process both short and long context sequences, and a finetuning approach that adapts this base model to retrieval with only single-sample batches. Finally, we validate the M2-BERT retrieval encoder on LoCoV1, finding that it outperforms competitive Transformer-based models by at least 22.2 points, despite containing 90× fewer parameters.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-saad-falcon24a,
  title = 	 {Benchmarking and Building Long-Context Retrieval Models with {L}o{C}o and M2-{BERT}},
  author =       {Saad-Falcon, Jon and Fu, Daniel Y and Arora, Simran and Guha, Neel and Re, Christopher},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {42918--42946},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/saad-falcon24a/saad-falcon24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/saad-falcon24a.html},
  abstract = 	 {Retrieval pipelines are an integral component of many machine learning systems. However, they perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to finetune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long. We describe a pretraining data mixture which allows this encoder to process both short and long context sequences, and a finetuning approach that adapts this base model to retrieval with only single-sample batches. Finally, we validate the M2-BERT retrieval encoder on LoCoV1, finding that it outperforms competitive Transformer-based models by at least 22.2 points, despite containing 90× fewer parameters.}
}

Endnote

%0 Conference Paper
%T Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT
%A Jon Saad-Falcon
%A Daniel Y Fu
%A Simran Arora
%A Neel Guha
%A Christopher Re
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-saad-falcon24a
%I PMLR
%P 42918--42946
%U https://proceedings.mlr.press/v235/saad-falcon24a.html
%V 235
%X Retrieval pipelines are an integral component of many machine learning systems. However, they perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to finetune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long. We describe a pretraining data mixture which allows this encoder to process both short and long context sequences, and a finetuning approach that adapts this base model to retrieval with only single-sample batches. Finally, we validate the M2-BERT retrieval encoder on LoCoV1, finding that it outperforms competitive Transformer-based models by at least 22.2 points, despite containing 90× fewer parameters.

APA


Saad-Falcon, J., Fu, D.Y., Arora, S., Guha, N. & Re, C.. (2024). Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:42918-42946 Available from https://proceedings.mlr.press/v235/saad-falcon24a.html.

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Abstract

Cite this Paper

Related Material