LQER: Low-Rank Quantization Error Reconstruction for LLMs

Cheng Zhang, Jianyi Cheng, George Anthony Constantinides, Yiren Zhao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:58763-58779, 2024.

Abstract

Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-based iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using $1.36 \times$ fewer hardware resources than the leading state-of-the-art method. We will open-source our framework at https://github.com/ChengZhang-98/lqer

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zhang24j, title = {{LQER}: Low-Rank Quantization Error Reconstruction for {LLM}s}, author = {Zhang, Cheng and Cheng, Jianyi and Constantinides, George Anthony and Zhao, Yiren}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {58763--58779}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24j/zhang24j.pdf}, url = {https://proceedings.mlr.press/v235/zhang24j.html}, abstract = {Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-based iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using $1.36 \times$ fewer hardware resources than the leading state-of-the-art method. We will open-source our framework at https://github.com/ChengZhang-98/lqer} }
Endnote
%0 Conference Paper %T LQER: Low-Rank Quantization Error Reconstruction for LLMs %A Cheng Zhang %A Jianyi Cheng %A George Anthony Constantinides %A Yiren Zhao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zhang24j %I PMLR %P 58763--58779 %U https://proceedings.mlr.press/v235/zhang24j.html %V 235 %X Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-based iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using $1.36 \times$ fewer hardware resources than the leading state-of-the-art method. We will open-source our framework at https://github.com/ChengZhang-98/lqer
APA
Zhang, C., Cheng, J., Constantinides, G.A. & Zhao, Y.. (2024). LQER: Low-Rank Quantization Error Reconstruction for LLMs. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:58763-58779 Available from https://proceedings.mlr.press/v235/zhang24j.html.

Related Material