Quantization of Large Language Models with an Overdetermined Basis

Daniil Merkulov, Daria Cherniuk, Alexander Rudikov, Ivan Oseledets, Ekaterina Muravleva, Aleksandr Mikhalev, Boris Kashin
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:2527-2536, 2024.

Abstract

In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks, employing models like OPT of varying sizes. Our findings demonstrate that Kashin Quantization achieves competitive quality in model performance while ensuring superior data compression, marking a significant advancement in the field of data quantization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v244-merkulov24a, title = {Quantization of Large Language Models with an Overdetermined Basis}, author = {Merkulov, Daniil and Cherniuk, Daria and Rudikov, Alexander and Oseledets, Ivan and Muravleva, Ekaterina and Mikhalev, Aleksandr and Kashin, Boris}, booktitle = {Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence}, pages = {2527--2536}, year = {2024}, editor = {Kiyavash, Negar and Mooij, Joris M.}, volume = {244}, series = {Proceedings of Machine Learning Research}, month = {15--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v244/main/assets/merkulov24a/merkulov24a.pdf}, url = {https://proceedings.mlr.press/v244/merkulov24a.html}, abstract = {In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks, employing models like OPT of varying sizes. Our findings demonstrate that Kashin Quantization achieves competitive quality in model performance while ensuring superior data compression, marking a significant advancement in the field of data quantization.} }
Endnote
%0 Conference Paper %T Quantization of Large Language Models with an Overdetermined Basis %A Daniil Merkulov %A Daria Cherniuk %A Alexander Rudikov %A Ivan Oseledets %A Ekaterina Muravleva %A Aleksandr Mikhalev %A Boris Kashin %B Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2024 %E Negar Kiyavash %E Joris M. Mooij %F pmlr-v244-merkulov24a %I PMLR %P 2527--2536 %U https://proceedings.mlr.press/v244/merkulov24a.html %V 244 %X In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks, employing models like OPT of varying sizes. Our findings demonstrate that Kashin Quantization achieves competitive quality in model performance while ensuring superior data compression, marking a significant advancement in the field of data quantization.
APA
Merkulov, D., Cherniuk, D., Rudikov, A., Oseledets, I., Muravleva, E., Mikhalev, A. & Kashin, B.. (2024). Quantization of Large Language Models with an Overdetermined Basis. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 244:2527-2536 Available from https://proceedings.mlr.press/v244/merkulov24a.html.

Related Material