Matryoshka Quantization

Pranav Ajit Nair; Puranjay Datta; Jeff Dean; Prateek Jain; Aditya Kusupati

Matryoshka Quantization

Pranav Ajit Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45484-45506, 2025.

Abstract

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models – especially to low precisions like int4 or int2 – requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’s co-training and co-distillation, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively. Finally, we demonstrate that by using an extra bit to represent outliers, a model with an effective precision of 2.05-bit improves further by 6% with OmniQuant as the base algorithm.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-nair25a,
  title = 	 {Matryoshka Quantization},
  author =       {Nair, Pranav Ajit and Datta, Puranjay and Dean, Jeff and Jain, Prateek and Kusupati, Aditya},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {45484--45506},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nair25a/nair25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/nair25a.html},
  abstract = 	 {Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models – especially to low precisions like int4 or int2 – requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’s co-training and co-distillation, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively. Finally, we demonstrate that by using an extra bit to represent outliers, a model with an effective precision of 2.05-bit improves further by 6% with OmniQuant as the base algorithm.}
}

Endnote

%0 Conference Paper
%T Matryoshka Quantization
%A Pranav Ajit Nair
%A Puranjay Datta
%A Jeff Dean
%A Prateek Jain
%A Aditya Kusupati
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-nair25a
%I PMLR
%P 45484--45506
%U https://proceedings.mlr.press/v267/nair25a.html
%V 267
%X Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models – especially to low precisions like int4 or int2 – requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’s co-training and co-distillation, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively. Finally, we demonstrate that by using an extra bit to represent outliers, a model with an effective precision of 2.05-bit improves further by 6% with OmniQuant as the base algorithm.

APA

Nair, P.A., Datta, P., Dean, J., Jain, P. & Kusupati, A.. (2025). Matryoshka Quantization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:45484-45506 Available from https://proceedings.mlr.press/v267/nair25a.html.

Matryoshka Quantization

Abstract

Cite this Paper

Related Material