Snakes and Ladders: Accelerating SSM Inference with Speculative Decoding

Yangchao Wu, Yonatan Dukler, Matthew Trager, Alessandro Achille, Wei Xia, Stefano Soatto
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:292-304, 2024.

Abstract

Speculative decoding is a method for accelerating inference in large language models (LLMs) by predicting multiple tokens using a smaller ‘draft model’ and validating them against the larger ‘base model.’ If a draft token is inconsistent with what the base model would have generated, speculative decoding ‘backtracks’ to the last consistent token before resuming generation. This is straightforward in autoregressive Transformer architectures since their state is a sliding window of past tokens. However, their baseline inference complexity is quadratic in the number of input tokens. State Space Models (SSMs) have linear inference complexity, but they maintain a separate Markov state that makes backtracking non-trivial. We propose two methods to perform speculative decoding in SSMs: “Joint Attainment and Advancement” and “Activation Replay.” Both methods utilize idle computational resources to speculate and verify multiple tokens, allowing us to produce 6 tokens for 1.47$\times$ the cost of one, corresponding to an average 1.82$\times$ wall-clock speed-up on three different benchmarks using a simple $n$-gram for drafting. Furthermore, as model size increases, relative overhead of speculation and verification decreases: Scaling from 1.3B parameters to 13B reduces relative overhead from 1.98$\times$ to 1.22$\times$. Unlike Transformers, speculative decoding in SSMs can be easily applied to batches of sequences, allowing dynamic allocation of resources to fill gaps in compute utilization and thereby improving efficiency and throughput with variable inference traffic.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-wu24a, title = {Snakes and Ladders: Accelerating {SSM} Inference with Speculative Decoding}, author = {Wu, Yangchao and Dukler, Yonatan and Trager, Matthew and Achille, Alessandro and Xia, Wei and Soatto, Stefano}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {292--304}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/wu24a/wu24a.pdf}, url = {https://proceedings.mlr.press/v262/wu24a.html}, abstract = {Speculative decoding is a method for accelerating inference in large language models (LLMs) by predicting multiple tokens using a smaller ‘draft model’ and validating them against the larger ‘base model.’ If a draft token is inconsistent with what the base model would have generated, speculative decoding ‘backtracks’ to the last consistent token before resuming generation. This is straightforward in autoregressive Transformer architectures since their state is a sliding window of past tokens. However, their baseline inference complexity is quadratic in the number of input tokens. State Space Models (SSMs) have linear inference complexity, but they maintain a separate Markov state that makes backtracking non-trivial. We propose two methods to perform speculative decoding in SSMs: “Joint Attainment and Advancement” and “Activation Replay.” Both methods utilize idle computational resources to speculate and verify multiple tokens, allowing us to produce 6 tokens for 1.47$\times$ the cost of one, corresponding to an average 1.82$\times$ wall-clock speed-up on three different benchmarks using a simple $n$-gram for drafting. Furthermore, as model size increases, relative overhead of speculation and verification decreases: Scaling from 1.3B parameters to 13B reduces relative overhead from 1.98$\times$ to 1.22$\times$. Unlike Transformers, speculative decoding in SSMs can be easily applied to batches of sequences, allowing dynamic allocation of resources to fill gaps in compute utilization and thereby improving efficiency and throughput with variable inference traffic.} }
Endnote
%0 Conference Paper %T Snakes and Ladders: Accelerating SSM Inference with Speculative Decoding %A Yangchao Wu %A Yonatan Dukler %A Matthew Trager %A Alessandro Achille %A Wei Xia %A Stefano Soatto %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-wu24a %I PMLR %P 292--304 %U https://proceedings.mlr.press/v262/wu24a.html %V 262 %X Speculative decoding is a method for accelerating inference in large language models (LLMs) by predicting multiple tokens using a smaller ‘draft model’ and validating them against the larger ‘base model.’ If a draft token is inconsistent with what the base model would have generated, speculative decoding ‘backtracks’ to the last consistent token before resuming generation. This is straightforward in autoregressive Transformer architectures since their state is a sliding window of past tokens. However, their baseline inference complexity is quadratic in the number of input tokens. State Space Models (SSMs) have linear inference complexity, but they maintain a separate Markov state that makes backtracking non-trivial. We propose two methods to perform speculative decoding in SSMs: “Joint Attainment and Advancement” and “Activation Replay.” Both methods utilize idle computational resources to speculate and verify multiple tokens, allowing us to produce 6 tokens for 1.47$\times$ the cost of one, corresponding to an average 1.82$\times$ wall-clock speed-up on three different benchmarks using a simple $n$-gram for drafting. Furthermore, as model size increases, relative overhead of speculation and verification decreases: Scaling from 1.3B parameters to 13B reduces relative overhead from 1.98$\times$ to 1.22$\times$. Unlike Transformers, speculative decoding in SSMs can be easily applied to batches of sequences, allowing dynamic allocation of resources to fill gaps in compute utilization and thereby improving efficiency and throughput with variable inference traffic.
APA
Wu, Y., Dukler, Y., Trager, M., Achille, A., Xia, W. & Soatto, S.. (2024). Snakes and Ladders: Accelerating SSM Inference with Speculative Decoding. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:292-304 Available from https://proceedings.mlr.press/v262/wu24a.html.

Related Material