EL-Attention: Memory Efficient Lossless Attention for Generation

Yu Yan, Jiusheng Chen, Weizhen Qi, Nikhil Bhendawade, Yeyun Gong, Nan Duan, Ruofei Zhang
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11648-11658, 2021.

Abstract

Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-yan21a, title = {EL-Attention: Memory Efficient Lossless Attention for Generation}, author = {Yan, Yu and Chen, Jiusheng and Qi, Weizhen and Bhendawade, Nikhil and Gong, Yeyun and Duan, Nan and Zhang, Ruofei}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {11648--11658}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/yan21a/yan21a.pdf}, url = {https://proceedings.mlr.press/v139/yan21a.html}, abstract = {Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.} }
Endnote
%0 Conference Paper %T EL-Attention: Memory Efficient Lossless Attention for Generation %A Yu Yan %A Jiusheng Chen %A Weizhen Qi %A Nikhil Bhendawade %A Yeyun Gong %A Nan Duan %A Ruofei Zhang %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-yan21a %I PMLR %P 11648--11658 %U https://proceedings.mlr.press/v139/yan21a.html %V 139 %X Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
APA
Yan, Y., Chen, J., Qi, W., Bhendawade, N., Gong, Y., Duan, N. & Zhang, R.. (2021). EL-Attention: Memory Efficient Lossless Attention for Generation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:11648-11658 Available from https://proceedings.mlr.press/v139/yan21a.html.

Related Material