Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna
Conference on Parsimony and Learning, PMLR 280:1170-1190, 2025.

Abstract

N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (50%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions (80%). In this work, we study the effectiveness of existing sparse training recipes at high-sparsity regions and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2% and 5% in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method performs better than conventional sparse training recipes, exhibiting an accuracy improvement of up to 2%.

Cite this Paper


BibTeX
@InProceedings{pmlr-v280-bambhaniya25a, title = {Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers}, author = {Bambhaniya, Abhimanyu Rajeshkumar and Yazdanbakhsh, Amir and Subramanian, Suvinay and Kao, Sheng-Chun and Agrawal, Shivani and Evci, Utku and Krishna, Tushar}, booktitle = {Conference on Parsimony and Learning}, pages = {1170--1190}, year = {2025}, editor = {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui}, volume = {280}, series = {Proceedings of Machine Learning Research}, month = {24--27 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v280/main/assets/bambhaniya25a/bambhaniya25a.pdf}, url = {https://proceedings.mlr.press/v280/bambhaniya25a.html}, abstract = {N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (50%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions (80%). In this work, we study the effectiveness of existing sparse training recipes at high-sparsity regions and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2% and 5% in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method performs better than conventional sparse training recipes, exhibiting an accuracy improvement of up to 2%.} }
Endnote
%0 Conference Paper %T Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers %A Abhimanyu Rajeshkumar Bambhaniya %A Amir Yazdanbakhsh %A Suvinay Subramanian %A Sheng-Chun Kao %A Shivani Agrawal %A Utku Evci %A Tushar Krishna %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2025 %E Beidi Chen %E Shijia Liu %E Mert Pilanci %E Weijie Su %E Jeremias Sulam %E Yuxiang Wang %E Zhihui Zhu %F pmlr-v280-bambhaniya25a %I PMLR %P 1170--1190 %U https://proceedings.mlr.press/v280/bambhaniya25a.html %V 280 %X N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (50%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions (80%). In this work, we study the effectiveness of existing sparse training recipes at high-sparsity regions and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2% and 5% in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method performs better than conventional sparse training recipes, exhibiting an accuracy improvement of up to 2%.
APA
Bambhaniya, A.R., Yazdanbakhsh, A., Subramanian, S., Kao, S., Agrawal, S., Evci, U. & Krishna, T.. (2025). Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:1170-1190 Available from https://proceedings.mlr.press/v280/bambhaniya25a.html.

Related Material