Inductive Biases and Variable Creation in Self-Attention Mechanisms

Benjamin L Edelman, Surbhi Goel, Sham Kakade, Cyril Zhang
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:5793-5831, 2022.

Abstract

Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-edelman22a, title = {Inductive Biases and Variable Creation in Self-Attention Mechanisms}, author = {Edelman, Benjamin L and Goel, Surbhi and Kakade, Sham and Zhang, Cyril}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {5793--5831}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/edelman22a/edelman22a.pdf}, url = {https://proceedings.mlr.press/v162/edelman22a.html}, abstract = {Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.} }
Endnote
%0 Conference Paper %T Inductive Biases and Variable Creation in Self-Attention Mechanisms %A Benjamin L Edelman %A Surbhi Goel %A Sham Kakade %A Cyril Zhang %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-edelman22a %I PMLR %P 5793--5831 %U https://proceedings.mlr.press/v162/edelman22a.html %V 162 %X Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
APA
Edelman, B.L., Goel, S., Kakade, S. & Zhang, C.. (2022). Inductive Biases and Variable Creation in Self-Attention Mechanisms. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:5793-5831 Available from https://proceedings.mlr.press/v162/edelman22a.html.

Related Material