BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization

Lancheng Zou, Wenqian Zhao, Shuo Yin, Chen Bai, Qi Sun, Bei Yu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62978-62992, 2024.

Abstract

Nowadays, Large Language Models (LLMs) mostly possess billions of parameters, bringing significant challenges to hardware platforms. Although quantization is an efficient approach to reduce computation and memory overhead for inference optimization, we stress the challenge that mainstream low-bit quantization approaches still suffer from either various data distribution outliers or a lack of hardware efficiency. We also find that low-bit data format has further potential expressiveness to cover the atypical language data distribution. In this paper, we propose a novel numerical representation, Bi-Exponent Block Floating Point (BiE), and a new quantization flow. BiE quantization shows accuracy superiority and hardware friendliness on various models and benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zou24d, title = {{B}i{E}: Bi-Exponent Block Floating-Point for Large Language Models Quantization}, author = {Zou, Lancheng and Zhao, Wenqian and Yin, Shuo and Bai, Chen and Sun, Qi and Yu, Bei}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {62978--62992}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zou24d/zou24d.pdf}, url = {https://proceedings.mlr.press/v235/zou24d.html}, abstract = {Nowadays, Large Language Models (LLMs) mostly possess billions of parameters, bringing significant challenges to hardware platforms. Although quantization is an efficient approach to reduce computation and memory overhead for inference optimization, we stress the challenge that mainstream low-bit quantization approaches still suffer from either various data distribution outliers or a lack of hardware efficiency. We also find that low-bit data format has further potential expressiveness to cover the atypical language data distribution. In this paper, we propose a novel numerical representation, Bi-Exponent Block Floating Point (BiE), and a new quantization flow. BiE quantization shows accuracy superiority and hardware friendliness on various models and benchmarks.} }
Endnote
%0 Conference Paper %T BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization %A Lancheng Zou %A Wenqian Zhao %A Shuo Yin %A Chen Bai %A Qi Sun %A Bei Yu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zou24d %I PMLR %P 62978--62992 %U https://proceedings.mlr.press/v235/zou24d.html %V 235 %X Nowadays, Large Language Models (LLMs) mostly possess billions of parameters, bringing significant challenges to hardware platforms. Although quantization is an efficient approach to reduce computation and memory overhead for inference optimization, we stress the challenge that mainstream low-bit quantization approaches still suffer from either various data distribution outliers or a lack of hardware efficiency. We also find that low-bit data format has further potential expressiveness to cover the atypical language data distribution. In this paper, we propose a novel numerical representation, Bi-Exponent Block Floating Point (BiE), and a new quantization flow. BiE quantization shows accuracy superiority and hardware friendliness on various models and benchmarks.
APA
Zou, L., Zhao, W., Yin, S., Bai, C., Sun, Q. & Yu, B.. (2024). BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:62978-62992 Available from https://proceedings.mlr.press/v235/zou24d.html.

Related Material