Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10411-10427, 2025.

Abstract

State Space Models (SSMs) are gaining attention as an efficient alternative to Transformers due to their constant memory complexity and comparable performance. Yet, deploying large-scale SSMs on cloud-based services or resource-constrained devices faces challenges. To address this, quantizing SSMs using low bit-width data types is proposed to reduce model size and leverage hardware acceleration. Given that SSMs are sensitive to quantization errors, recent advancements focus on quantizing a specific model or bit-width to improve their efficiency while maintaining performance. However, different bit-width configurations, such as W4A8 for cloud service throughput and W4A16 for improving question-answering on personal devices, are necessary for specific scenarios. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba and Mamba2, addressing the rising demand for SSM deployment across various platforms. We propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for $x$, combined with a per-state-group quantization for $B$ and $C$. To ensure compute-invariance in the SSM output, we offline rearrange weights according to the clustering sequence. The experiments show Quamba2-8B outperforms several state-of-the-art SSMs quantization methods and delivers 1.3$\times$ and 3$\times$ speedup in the pre-filling and generation stages and 4$\times$ memory reduction with only a $1.6$% accuracy drop on average. The code and quantized models will be released at:

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chiang25a, title = {Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models}, author = {Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang and Abdelfattah, Mohamed S. and Marculescu, Diana}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10411--10427}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chiang25a/chiang25a.pdf}, url = {https://proceedings.mlr.press/v267/chiang25a.html}, abstract = {State Space Models (SSMs) are gaining attention as an efficient alternative to Transformers due to their constant memory complexity and comparable performance. Yet, deploying large-scale SSMs on cloud-based services or resource-constrained devices faces challenges. To address this, quantizing SSMs using low bit-width data types is proposed to reduce model size and leverage hardware acceleration. Given that SSMs are sensitive to quantization errors, recent advancements focus on quantizing a specific model or bit-width to improve their efficiency while maintaining performance. However, different bit-width configurations, such as W4A8 for cloud service throughput and W4A16 for improving question-answering on personal devices, are necessary for specific scenarios. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba and Mamba2, addressing the rising demand for SSM deployment across various platforms. We propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for $x$, combined with a per-state-group quantization for $B$ and $C$. To ensure compute-invariance in the SSM output, we offline rearrange weights according to the clustering sequence. The experiments show Quamba2-8B outperforms several state-of-the-art SSMs quantization methods and delivers 1.3$\times$ and 3$\times$ speedup in the pre-filling and generation stages and 4$\times$ memory reduction with only a $1.6$% accuracy drop on average. The code and quantized models will be released at:} }
Endnote
%0 Conference Paper %T Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models %A Hung-Yueh Chiang %A Chi-Chih Chang %A Natalia Frumkin %A Kai-Chiang Wu %A Mohamed S. Abdelfattah %A Diana Marculescu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chiang25a %I PMLR %P 10411--10427 %U https://proceedings.mlr.press/v267/chiang25a.html %V 267 %X State Space Models (SSMs) are gaining attention as an efficient alternative to Transformers due to their constant memory complexity and comparable performance. Yet, deploying large-scale SSMs on cloud-based services or resource-constrained devices faces challenges. To address this, quantizing SSMs using low bit-width data types is proposed to reduce model size and leverage hardware acceleration. Given that SSMs are sensitive to quantization errors, recent advancements focus on quantizing a specific model or bit-width to improve their efficiency while maintaining performance. However, different bit-width configurations, such as W4A8 for cloud service throughput and W4A16 for improving question-answering on personal devices, are necessary for specific scenarios. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba and Mamba2, addressing the rising demand for SSM deployment across various platforms. We propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for $x$, combined with a per-state-group quantization for $B$ and $C$. To ensure compute-invariance in the SSM output, we offline rearrange weights according to the clustering sequence. The experiments show Quamba2-8B outperforms several state-of-the-art SSMs quantization methods and delivers 1.3$\times$ and 3$\times$ speedup in the pre-filling and generation stages and 4$\times$ memory reduction with only a $1.6$% accuracy drop on average. The code and quantized models will be released at:
APA
Chiang, H., Chang, C., Frumkin, N., Wu, K., Abdelfattah, M.S. & Marculescu, D.. (2025). Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10411-10427 Available from https://proceedings.mlr.press/v267/chiang25a.html.

Related Material