What Dense Graph Do You Need for Self-Attention?

Yuxin Wang, Chu-Tak Lee, Qipeng Guo, Zhangyue Yin, Yunhua Zhou, Xuanjing Huang, Xipeng Qiu
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:22752-22768, 2022.

Abstract

Transformers have made progress in miscellaneous tasks, but suffer from quadratic computational and memory complexities. Recent works propose sparse transformers with attention on sparse graphs to reduce complexity and remain strong performance. While effective, the crucial parts of how dense a graph needs to be to perform well are not fully explored. In this paper, we propose Normalized Information Payload (NIP), a graph scoring function measuring information transfer on graph, which provides an analysis tool for trade-offs between performance and complexity. Guided by this theoretical analysis, we present Hypercube Transformer, a sparse transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla transformer while yielding $O(N\log N)$ complexity with sequence length $N$. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-wang22l, title = {What Dense Graph Do You Need for Self-Attention?}, author = {Wang, Yuxin and Lee, Chu-Tak and Guo, Qipeng and Yin, Zhangyue and Zhou, Yunhua and Huang, Xuanjing and Qiu, Xipeng}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {22752--22768}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/wang22l/wang22l.pdf}, url = {https://proceedings.mlr.press/v162/wang22l.html}, abstract = {Transformers have made progress in miscellaneous tasks, but suffer from quadratic computational and memory complexities. Recent works propose sparse transformers with attention on sparse graphs to reduce complexity and remain strong performance. While effective, the crucial parts of how dense a graph needs to be to perform well are not fully explored. In this paper, we propose Normalized Information Payload (NIP), a graph scoring function measuring information transfer on graph, which provides an analysis tool for trade-offs between performance and complexity. Guided by this theoretical analysis, we present Hypercube Transformer, a sparse transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla transformer while yielding $O(N\log N)$ complexity with sequence length $N$. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.} }
Endnote
%0 Conference Paper %T What Dense Graph Do You Need for Self-Attention? %A Yuxin Wang %A Chu-Tak Lee %A Qipeng Guo %A Zhangyue Yin %A Yunhua Zhou %A Xuanjing Huang %A Xipeng Qiu %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-wang22l %I PMLR %P 22752--22768 %U https://proceedings.mlr.press/v162/wang22l.html %V 162 %X Transformers have made progress in miscellaneous tasks, but suffer from quadratic computational and memory complexities. Recent works propose sparse transformers with attention on sparse graphs to reduce complexity and remain strong performance. While effective, the crucial parts of how dense a graph needs to be to perform well are not fully explored. In this paper, we propose Normalized Information Payload (NIP), a graph scoring function measuring information transfer on graph, which provides an analysis tool for trade-offs between performance and complexity. Guided by this theoretical analysis, we present Hypercube Transformer, a sparse transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla transformer while yielding $O(N\log N)$ complexity with sequence length $N$. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
APA
Wang, Y., Lee, C., Guo, Q., Yin, Z., Zhou, Y., Huang, X. & Qiu, X.. (2022). What Dense Graph Do You Need for Self-Attention?. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:22752-22768 Available from https://proceedings.mlr.press/v162/wang22l.html.

Related Material