Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:35507-35527, 2024.

Abstract

Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a $93\times$ reduction in compute time while improving accuracy by $43%$ on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-merth24a, title = {Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation}, author = {Merth, Thomas and Fu, Qichen and Rastegari, Mohammad and Najibi, Mahyar}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {35507--35527}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/merth24a/merth24a.pdf}, url = {https://proceedings.mlr.press/v235/merth24a.html}, abstract = {Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a $93\times$ reduction in compute time while improving accuracy by $43%$ on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.} }
Endnote
%0 Conference Paper %T Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation %A Thomas Merth %A Qichen Fu %A Mohammad Rastegari %A Mahyar Najibi %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-merth24a %I PMLR %P 35507--35527 %U https://proceedings.mlr.press/v235/merth24a.html %V 235 %X Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a $93\times$ reduction in compute time while improving accuracy by $43%$ on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.
APA
Merth, T., Fu, Q., Rastegari, M. & Najibi, M.. (2024). Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:35507-35527 Available from https://proceedings.mlr.press/v235/merth24a.html.

Related Material