NestQuant: nested lattice quantization for matrix products and LLMs

Semyon Savkin; Eitan Porat; Or Ordentlich; Yury Polyanskiy

NestQuant: nested lattice quantization for matrix products and LLMs

Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:53042-53062, 2025.

Abstract

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-savkin25a,
  title = 	 {{N}est{Q}uant: nested lattice quantization for matrix products and {LLM}s},
  author =       {Savkin, Semyon and Porat, Eitan and Ordentlich, Or and Polyanskiy, Yury},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {53042--53062},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/savkin25a/savkin25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/savkin25a.html},
  abstract = 	 {Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.}
}

Endnote

%0 Conference Paper
%T NestQuant: nested lattice quantization for matrix products and LLMs
%A Semyon Savkin
%A Eitan Porat
%A Or Ordentlich
%A Yury Polyanskiy
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-savkin25a
%I PMLR
%P 53042--53062
%U https://proceedings.mlr.press/v267/savkin25a.html
%V 267
%X Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

APA

Savkin, S., Porat, E., Ordentlich, O. & Polyanskiy, Y.. (2025). NestQuant: nested lattice quantization for matrix products and LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:53042-53062 Available from https://proceedings.mlr.press/v267/savkin25a.html.

NestQuant: nested lattice quantization for matrix products and LLMs

Abstract

Cite this Paper

Related Material