Attention-Level Speculation

Jack Cai; Ammar Vora; Randolph Zhang; Mark O’Connor; Mark C. Jeffrey

Attention-Level Speculation

Jack Cai, Ammar Vora, Randolph Zhang, Mark O’Connor, Mark C. Jeffrey

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:6291-6318, 2025.

Abstract

As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent’s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-cai25g,
  title = 	 {Attention-Level Speculation},
  author =       {Cai, Jack and Vora, Ammar and Zhang, Randolph and O'Connor, Mark and Jeffrey, Mark C.},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {6291--6318},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/cai25g/cai25g.pdf},
  url = 	 {https://proceedings.mlr.press/v267/cai25g.html},
  abstract = 	 {As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent’s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.}
}

Endnote

%0 Conference Paper
%T Attention-Level Speculation
%A Jack Cai
%A Ammar Vora
%A Randolph Zhang
%A Mark O’Connor
%A Mark C. Jeffrey
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-cai25g
%I PMLR
%P 6291--6318
%U https://proceedings.mlr.press/v267/cai25g.html
%V 267
%X As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent’s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.

APA

Cai, J., Vora, A., Zhang, R., O’Connor, M. & Jeffrey, M.C.. (2025). Attention-Level Speculation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:6291-6318 Available from https://proceedings.mlr.press/v267/cai25g.html.

Attention-Level Speculation

Abstract

Cite this Paper

Related Material