Improving Multi-candidate Speculative Decoding

XiaoFan Lu; Yixiao Zeng; Marco Levorato; FeiYang Ma; ZiXu Yu

Improving Multi-candidate Speculative Decoding

XiaoFan Lu, Yixiao Zeng, Marco Levorato, FeiYang Ma, ZiXu Yu

Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:382-394, 2024.

Abstract

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model’s output distribution differences, especially in a dynamic generation context. In this work, we introduce a new version of MCSD that includes a target model initialized multi-candidate generation, a dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. We experimented with our method on Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to our MCSD baseline across three benchmarks with Llama 2-7B as the target model and JackFram 68M as the draft model. Additionally, we evaluate the effects of using the target model initialized multi-candidate process with different draft models on output quality.

Cite this Paper

BibTeX


@InProceedings{pmlr-v262-lu24a,
  title = 	 {Improving Multi-candidate Speculative Decoding},
  author =       {Lu, XiaoFan and Zeng, Yixiao and Levorato, Marco and Ma, FeiYang and Yu, ZiXu},
  booktitle = 	 {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop},
  pages = 	 {382--394},
  year = 	 {2024},
  editor = 	 {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing},
  volume = 	 {262},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v262/main/assets/lu24a/lu24a.pdf},
  url = 	 {https://proceedings.mlr.press/v262/lu24a.html},
  abstract = 	 {Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model’s output distribution differences, especially in a dynamic generation context. In this work, we introduce a new version of MCSD that includes a target model initialized multi-candidate generation, a dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. We experimented with our method on Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to our MCSD baseline across three benchmarks with Llama 2-7B as the target model and JackFram 68M as the draft model. Additionally, we evaluate the effects of using the target model initialized multi-candidate process with different draft models on output quality.}
}

Endnote

%0 Conference Paper
%T Improving Multi-candidate Speculative Decoding
%A XiaoFan Lu
%A Yixiao Zeng
%A Marco Levorato
%A FeiYang Ma
%A ZiXu Yu
%B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop
%C Proceedings of Machine Learning Research
%D 2024
%E Mehdi Rezagholizadeh
%E Peyman Passban
%E Soheila Samiee
%E Vahid Partovi Nia
%E Yu Cheng
%E Yue Deng
%E Qun Liu
%E Boxing Chen	
%F pmlr-v262-lu24a
%I PMLR
%P 382--394
%U https://proceedings.mlr.press/v262/lu24a.html
%V 262
%X Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model’s output distribution differences, especially in a dynamic generation context. In this work, we introduce a new version of MCSD that includes a target model initialized multi-candidate generation, a dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. We experimented with our method on Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to our MCSD baseline across three benchmarks with Llama 2-7B as the target model and JackFram 68M as the draft model. Additionally, we evaluate the effects of using the target model initialized multi-candidate process with different draft models on output quality.

APA


Lu, X., Zeng, Y., Levorato, M., Ma, F. & Yu, Z.. (2024). Improving Multi-candidate Speculative Decoding. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:382-394 Available from https://proceedings.mlr.press/v262/lu24a.html.

Related Material

Download PDF