SparseBERT: Rethinking the Importance Analysis in Self-attention

Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James Tin-Yau Kwok
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9547-9557, 2021.

Abstract

Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-shi21a, title = {SparseBERT: Rethinking the Importance Analysis in Self-attention}, author = {Shi, Han and Gao, Jiahui and Ren, Xiaozhe and Xu, Hang and Liang, Xiaodan and Li, Zhenguo and Kwok, James Tin-Yau}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {9547--9557}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/shi21a/shi21a.pdf}, url = {https://proceedings.mlr.press/v139/shi21a.html}, abstract = {Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.} }
Endnote
%0 Conference Paper %T SparseBERT: Rethinking the Importance Analysis in Self-attention %A Han Shi %A Jiahui Gao %A Xiaozhe Ren %A Hang Xu %A Xiaodan Liang %A Zhenguo Li %A James Tin-Yau Kwok %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-shi21a %I PMLR %P 9547--9557 %U https://proceedings.mlr.press/v139/shi21a.html %V 139 %X Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.
APA
Shi, H., Gao, J., Ren, X., Xu, H., Liang, X., Li, Z. & Kwok, J.T.. (2021). SparseBERT: Rethinking the Importance Analysis in Self-attention. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:9547-9557 Available from https://proceedings.mlr.press/v139/shi21a.html.

Related Material