Fast Inference with Kronecker-Sparse Matrices

Antoine Gonon, Léon Zheng, Pascal Carrivain, Tung Quoc Le
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:20075-20102, 2025.

Abstract

Kronecker-sparse (KS) matrices—whose supports are Kronecker products of identity and all-ones blocks—underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at github.com/PascalCarrivain/ksmm, including a PyTorch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gonon25b, title = {Fast Inference with {K}ronecker-Sparse Matrices}, author = {Gonon, Antoine and Zheng, L\'{e}on and Carrivain, Pascal and Le, Tung Quoc}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {20075--20102}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gonon25b/gonon25b.pdf}, url = {https://proceedings.mlr.press/v267/gonon25b.html}, abstract = {Kronecker-sparse (KS) matrices—whose supports are Kronecker products of identity and all-ones blocks—underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at github.com/PascalCarrivain/ksmm, including a PyTorch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.} }
Endnote
%0 Conference Paper %T Fast Inference with Kronecker-Sparse Matrices %A Antoine Gonon %A Léon Zheng %A Pascal Carrivain %A Tung Quoc Le %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gonon25b %I PMLR %P 20075--20102 %U https://proceedings.mlr.press/v267/gonon25b.html %V 267 %X Kronecker-sparse (KS) matrices—whose supports are Kronecker products of identity and all-ones blocks—underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at github.com/PascalCarrivain/ksmm, including a PyTorch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.
APA
Gonon, A., Zheng, L., Carrivain, P. & Le, T.Q.. (2025). Fast Inference with Kronecker-Sparse Matrices. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:20075-20102 Available from https://proceedings.mlr.press/v267/gonon25b.html.

Related Material