Post-Training Statistical Calibration for Higher Activation Sparsity

Vui Seng Chua, Yujie Pan, Nilesh Jain, Vui Seng Chua
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:206-221, 2024.

Abstract

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS[12] at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at https://github.com/IntelLabs/SCAP.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-seng-chua24a, title = {Post-Training Statistical Calibration for Higher Activation Sparsity}, author = {Seng Chua, Vui and Pan, Yujie and Jain, Nilesh and Seng Chua, Vui}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {206--221}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/seng-chua24a/seng-chua24a.pdf}, url = {https://proceedings.mlr.press/v262/seng-chua24a.html}, abstract = {We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS[12] at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at https://github.com/IntelLabs/SCAP.} }
Endnote
%0 Conference Paper %T Post-Training Statistical Calibration for Higher Activation Sparsity %A Vui Seng ChuaYujie Pan %A Nilesh Jain %A Vui Seng Chua %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-seng-chua24a %I PMLR %P 206--221 %U https://proceedings.mlr.press/v262/seng-chua24a.html %V 262 %X We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS[12] at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at https://github.com/IntelLabs/SCAP.
APA
Seng Chua, V.Pan, Y., Jain, N. & Seng Chua, V.. (2024). Post-Training Statistical Calibration for Higher Activation Sparsity. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:206-221 Available from https://proceedings.mlr.press/v262/seng-chua24a.html.

Related Material