GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11704-11720, 2024.

Abstract

Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CaPE is a proposal expansion method that uses the draft model’s confidence scores to help select additional candidate tokens for verification. Extensive experiments on different benchmarks demonstrate that our proposed GliDe draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CaPE. We will release our code, data, and the trained draft models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-du24c, title = {{G}li{D}e with a {C}a{PE}: A Low-Hassle Method to Accelerate Speculative Decoding}, author = {Du, Cunxiao and Jiang, Jing and Yuanchen, Xu and Wu, Jiawei and Yu, Sicheng and Li, Yongqi and Li, Shenggui and Xu, Kai and Nie, Liqiang and Tu, Zhaopeng and You, Yang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {11704--11720}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/du24c/du24c.pdf}, url = {https://proceedings.mlr.press/v235/du24c.html}, abstract = {Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CaPE is a proposal expansion method that uses the draft model’s confidence scores to help select additional candidate tokens for verification. Extensive experiments on different benchmarks demonstrate that our proposed GliDe draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CaPE. We will release our code, data, and the trained draft models.} }
Endnote
%0 Conference Paper %T GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding %A Cunxiao Du %A Jing Jiang %A Xu Yuanchen %A Jiawei Wu %A Sicheng Yu %A Yongqi Li %A Shenggui Li %A Kai Xu %A Liqiang Nie %A Zhaopeng Tu %A Yang You %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-du24c %I PMLR %P 11704--11720 %U https://proceedings.mlr.press/v235/du24c.html %V 235 %X Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CaPE is a proposal expansion method that uses the draft model’s confidence scores to help select additional candidate tokens for verification. Extensive experiments on different benchmarks demonstrate that our proposed GliDe draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CaPE. We will release our code, data, and the trained draft models.
APA
Du, C., Jiang, J., Yuanchen, X., Wu, J., Yu, S., Li, Y., Li, S., Xu, K., Nie, L., Tu, Z. & You, Y.. (2024). GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:11704-11720 Available from https://proceedings.mlr.press/v235/du24c.html.

Related Material