Sparse Structure Exploration and Re-optimization for Vision Transformer

Sangho An, Jinwoo Kim, Keonho Lee, Jingang Huh, Chanwoong Kwak, Yujin Lee, Moonsub Jin, Jangho Kim
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:111-131, 2025.

Abstract

Vision Transformers (ViTs) achieve outstanding performance by effectively capturing long-range dependencies between image patches (tokens). However, the high computational cost and memory requirements of ViTs present challenges for model compression and deployment on edge devices. In this study, we introduce a new framework, Sparse Structure Exploration and Re-optimization (SERo), specifically designed to maximize pruning efficiency in ViTs. Our approach focuses on (1) hardware-friendly pruning that fully compresses pruned parameters instead of zeroing them out, (2) separating the exploration and re-optimization phases \red{in order to find the optimal structure among various possible sparse structures}, and (3) using a simple gradient magnitude-based criterion for pruning a pre-trained model. SERo iteratively refines pruning masks to identify optimal sparse structures and then re-optimizes the pruned structure, reducing computational costs while maintaining model performance. Experimental results indicate that SERo surpasses existing pruning methods across various ViT models in both performance and computational efficiency. For example, SERo achieves a 69% reduction in computational cost and a 2.4x increase in processing speed for DeiT-Base model, with only a 1.55% drop in accuracy. Implementation code: https://github.com/Ahnho/SERo/

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-an25a, title = {Sparse Structure Exploration and Re-optimization for Vision Transformer}, author = {An, Sangho and Kim, Jinwoo and Lee, Keonho and Huh, Jingang and Kwak, Chanwoong and Lee, Yujin and Jin, Moonsub and Kim, Jangho}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {111--131}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/an25a/an25a.pdf}, url = {https://proceedings.mlr.press/v286/an25a.html}, abstract = {Vision Transformers (ViTs) achieve outstanding performance by effectively capturing long-range dependencies between image patches (tokens). However, the high computational cost and memory requirements of ViTs present challenges for model compression and deployment on edge devices. In this study, we introduce a new framework, Sparse Structure Exploration and Re-optimization (SERo), specifically designed to maximize pruning efficiency in ViTs. Our approach focuses on (1) hardware-friendly pruning that fully compresses pruned parameters instead of zeroing them out, (2) separating the exploration and re-optimization phases \red{in order to find the optimal structure among various possible sparse structures}, and (3) using a simple gradient magnitude-based criterion for pruning a pre-trained model. SERo iteratively refines pruning masks to identify optimal sparse structures and then re-optimizes the pruned structure, reducing computational costs while maintaining model performance. Experimental results indicate that SERo surpasses existing pruning methods across various ViT models in both performance and computational efficiency. For example, SERo achieves a 69% reduction in computational cost and a 2.4x increase in processing speed for DeiT-Base model, with only a 1.55% drop in accuracy. Implementation code: https://github.com/Ahnho/SERo/} }
Endnote
%0 Conference Paper %T Sparse Structure Exploration and Re-optimization for Vision Transformer %A Sangho An %A Jinwoo Kim %A Keonho Lee %A Jingang Huh %A Chanwoong Kwak %A Yujin Lee %A Moonsub Jin %A Jangho Kim %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-an25a %I PMLR %P 111--131 %U https://proceedings.mlr.press/v286/an25a.html %V 286 %X Vision Transformers (ViTs) achieve outstanding performance by effectively capturing long-range dependencies between image patches (tokens). However, the high computational cost and memory requirements of ViTs present challenges for model compression and deployment on edge devices. In this study, we introduce a new framework, Sparse Structure Exploration and Re-optimization (SERo), specifically designed to maximize pruning efficiency in ViTs. Our approach focuses on (1) hardware-friendly pruning that fully compresses pruned parameters instead of zeroing them out, (2) separating the exploration and re-optimization phases \red{in order to find the optimal structure among various possible sparse structures}, and (3) using a simple gradient magnitude-based criterion for pruning a pre-trained model. SERo iteratively refines pruning masks to identify optimal sparse structures and then re-optimizes the pruned structure, reducing computational costs while maintaining model performance. Experimental results indicate that SERo surpasses existing pruning methods across various ViT models in both performance and computational efficiency. For example, SERo achieves a 69% reduction in computational cost and a 2.4x increase in processing speed for DeiT-Base model, with only a 1.55% drop in accuracy. Implementation code: https://github.com/Ahnho/SERo/
APA
An, S., Kim, J., Lee, K., Huh, J., Kwak, C., Lee, Y., Jin, M. & Kim, J.. (2025). Sparse Structure Exploration and Re-optimization for Vision Transformer. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:111-131 Available from https://proceedings.mlr.press/v286/an25a.html.

Related Material