ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Ashkan Shahbazi; Elaheh Akbari; Darian Salehi; Xinran Liu; Navid Naderializadeh; Soheil Kolouri

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Ashkan Shahbazi, Elaheh Akbari, Darian Salehi, Xinran Liu, Navid Naderializadeh, Soheil Kolouri

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54186-54202, 2025.

Abstract

While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at https://github.com/dariansal/ESPFormer.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-shahbazi25a,
  title = 	 {{ESPF}ormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans},
  author =       {Shahbazi, Ashkan and Akbari, Elaheh and Salehi, Darian and Liu, Xinran and Naderializadeh, Navid and Kolouri, Soheil},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {54186--54202},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shahbazi25a/shahbazi25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/shahbazi25a.html},
  abstract = 	 {While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at https://github.com/dariansal/ESPFormer.}
}

Endnote

%0 Conference Paper
%T ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans
%A Ashkan Shahbazi
%A Elaheh Akbari
%A Darian Salehi
%A Xinran Liu
%A Navid Naderializadeh
%A Soheil Kolouri
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-shahbazi25a
%I PMLR
%P 54186--54202
%U https://proceedings.mlr.press/v267/shahbazi25a.html
%V 267
%X While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at https://github.com/dariansal/ESPFormer.

APA

Shahbazi, A., Akbari, E., Salehi, D., Liu, X., Naderializadeh, N. & Kolouri, S.. (2025). ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54186-54202 Available from https://proceedings.mlr.press/v267/shahbazi25a.html.

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Abstract

Cite this Paper

Related Material