Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, David Harel
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:59598-59620, 2025.

Abstract

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-timor25a, title = {Accelerating {LLM} Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies}, author = {Timor, Nadav and Mamou, Jonathan and Korat, Daniel and Berchansky, Moshe and Jain, Gaurav and Pereg, Oren and Wasserblat, Moshe and Harel, David}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {59598--59620}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/timor25a/timor25a.pdf}, url = {https://proceedings.mlr.press/v267/timor25a.html}, abstract = {Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.} }
Endnote
%0 Conference Paper %T Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies %A Nadav Timor %A Jonathan Mamou %A Daniel Korat %A Moshe Berchansky %A Gaurav Jain %A Oren Pereg %A Moshe Wasserblat %A David Harel %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-timor25a %I PMLR %P 59598--59620 %U https://proceedings.mlr.press/v267/timor25a.html %V 267 %X Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.
APA
Timor, N., Mamou, J., Korat, D., Berchansky, M., Jain, G., Pereg, O., Wasserblat, M. & Harel, D.. (2025). Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:59598-59620 Available from https://proceedings.mlr.press/v267/timor25a.html.

Related Material