Relative Positional Encoding for Transformers with Linear Complexity

Antoine Liutkus; Ondřej Cı́fka; Shih-Lun Wu; Umut Simsekli; Yi-Hsuan Yang; Gael Richard

Relative Positional Encoding for Transformers with Linear Complexity

Antoine Liutkus, Ondřej Cı́fka, Shih-Lun Wu, Umut Simsekli, Yi-Hsuan Yang, Gael Richard

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7067-7079, 2021.

Abstract

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-liutkus21a,
  title = 	 {Relative Positional Encoding for Transformers with Linear Complexity},
  author =       {Liutkus, Antoine and C\'{\i}fka, Ond{\v{r}}ej and Wu, Shih-Lun and Simsekli, Umut and Yang, Yi-Hsuan and Richard, Gael},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {7067--7079},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/liutkus21a/liutkus21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/liutkus21a.html},
  abstract = 	 {Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.}
}

Endnote

%0 Conference Paper
%T Relative Positional Encoding for Transformers with Linear Complexity
%A Antoine Liutkus
%A Ondřej Cı́fka
%A Shih-Lun Wu
%A Umut Simsekli
%A Yi-Hsuan Yang
%A Gael Richard
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-liutkus21a
%I PMLR
%P 7067--7079
%U https://proceedings.mlr.press/v139/liutkus21a.html
%V 139
%X Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

APA

Liutkus, A., Cı́fka, O., Wu, S., Simsekli, U., Yang, Y. & Richard, G.. (2021). Relative Positional Encoding for Transformers with Linear Complexity. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7067-7079 Available from https://proceedings.mlr.press/v139/liutkus21a.html.

Relative Positional Encoding for Transformers with Linear Complexity

Abstract

Cite this Paper

Related Material