Accelerated Speculative Sampling Based on Tree Monte Carlo

Zhengmian Hu, Heng Huang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19216-19251, 2024.

Abstract

Speculative Sampling (SpS) has been introduced to speed up inference of large language models (LLMs) by generating multiple tokens in a single forward pass under the guidance of a reference model, while preserving the original distribution. We observe that SpS can be derived through maximum coupling on the token distribution. However, we find that this approach is not optimal as it applies maximum coupling incrementally for each new token, rather than seeking a global maximum coupling that yields a faster algorithm, given the tree-space nature of LLM generative distributions. In this paper, we shift our focus from distributions on a token space to those on a tree space. We propose a novel class of Tree Monte Carlo (TMC) methods, demonstrating their unbiasedness and convergence. As a particular instance of TMC, our new algorithm, Accelerated Speculative Sampling (ASpS), outperforms traditional SpS by generating more tokens per step on average, achieving faster inference, while maintaining the original distribution.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-hu24f, title = {Accelerated Speculative Sampling Based on Tree {M}onte {C}arlo}, author = {Hu, Zhengmian and Huang, Heng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {19216--19251}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24f/hu24f.pdf}, url = {https://proceedings.mlr.press/v235/hu24f.html}, abstract = {Speculative Sampling (SpS) has been introduced to speed up inference of large language models (LLMs) by generating multiple tokens in a single forward pass under the guidance of a reference model, while preserving the original distribution. We observe that SpS can be derived through maximum coupling on the token distribution. However, we find that this approach is not optimal as it applies maximum coupling incrementally for each new token, rather than seeking a global maximum coupling that yields a faster algorithm, given the tree-space nature of LLM generative distributions. In this paper, we shift our focus from distributions on a token space to those on a tree space. We propose a novel class of Tree Monte Carlo (TMC) methods, demonstrating their unbiasedness and convergence. As a particular instance of TMC, our new algorithm, Accelerated Speculative Sampling (ASpS), outperforms traditional SpS by generating more tokens per step on average, achieving faster inference, while maintaining the original distribution.} }
Endnote
%0 Conference Paper %T Accelerated Speculative Sampling Based on Tree Monte Carlo %A Zhengmian Hu %A Heng Huang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-hu24f %I PMLR %P 19216--19251 %U https://proceedings.mlr.press/v235/hu24f.html %V 235 %X Speculative Sampling (SpS) has been introduced to speed up inference of large language models (LLMs) by generating multiple tokens in a single forward pass under the guidance of a reference model, while preserving the original distribution. We observe that SpS can be derived through maximum coupling on the token distribution. However, we find that this approach is not optimal as it applies maximum coupling incrementally for each new token, rather than seeking a global maximum coupling that yields a faster algorithm, given the tree-space nature of LLM generative distributions. In this paper, we shift our focus from distributions on a token space to those on a tree space. We propose a novel class of Tree Monte Carlo (TMC) methods, demonstrating their unbiasedness and convergence. As a particular instance of TMC, our new algorithm, Accelerated Speculative Sampling (ASpS), outperforms traditional SpS by generating more tokens per step on average, achieving faster inference, while maintaining the original distribution.
APA
Hu, Z. & Huang, H.. (2024). Accelerated Speculative Sampling Based on Tree Monte Carlo. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:19216-19251 Available from https://proceedings.mlr.press/v235/hu24f.html.

Related Material