Bifurcated Attention for Single-Context Large-Batch Sampling

Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1971-1991, 2024.

Abstract

In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with post-processing techniques such as reranking.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-athiwaratkun24a, title = {Bifurcated Attention for Single-Context Large-Batch Sampling}, author = {Athiwaratkun, Ben and Gonugondla, Sujan Kumar and Gouda, Sanjay Krishna and Qian, Haifeng and Ding, Hantian and Sun, Qing and Wang, Jun and Guo, Jiacheng and Chen, Liangfu and Bhatia, Parminder and Nallapati, Ramesh and Sengupta, Sudipta and Xiang, Bing}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {1971--1991}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/athiwaratkun24a/athiwaratkun24a.pdf}, url = {https://proceedings.mlr.press/v235/athiwaratkun24a.html}, abstract = {In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with post-processing techniques such as reranking.} }
Endnote
%0 Conference Paper %T Bifurcated Attention for Single-Context Large-Batch Sampling %A Ben Athiwaratkun %A Sujan Kumar Gonugondla %A Sanjay Krishna Gouda %A Haifeng Qian %A Hantian Ding %A Qing Sun %A Jun Wang %A Jiacheng Guo %A Liangfu Chen %A Parminder Bhatia %A Ramesh Nallapati %A Sudipta Sengupta %A Bing Xiang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-athiwaratkun24a %I PMLR %P 1971--1991 %U https://proceedings.mlr.press/v235/athiwaratkun24a.html %V 235 %X In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with post-processing techniques such as reranking.
APA
Athiwaratkun, B., Gonugondla, S.K., Gouda, S.K., Qian, H., Ding, H., Sun, Q., Wang, J., Guo, J., Chen, L., Bhatia, P., Nallapati, R., Sengupta, S. & Xiang, B.. (2024). Bifurcated Attention for Single-Context Large-Batch Sampling. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1971-1991 Available from https://proceedings.mlr.press/v235/athiwaratkun24a.html.

Related Material