SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Han-Byul Kim, Duc N.M Hoang, Arnav Kundu, Mohammad Samragh, Minsik Cho
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:30086-30098, 2025.

Abstract

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with $<$ 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-kim25g, title = {{SPD}: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models}, author = {Kim, Han-Byul and Hoang, Duc N.M and Kundu, Arnav and Samragh, Mohammad and Cho, Minsik}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {30086--30098}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/kim25g/kim25g.pdf}, url = {https://proceedings.mlr.press/v267/kim25g.html}, abstract = {With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with $<$ 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.} }
Endnote
%0 Conference Paper %T SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models %A Han-Byul Kim %A Duc N.M Hoang %A Arnav Kundu %A Mohammad Samragh %A Minsik Cho %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-kim25g %I PMLR %P 30086--30098 %U https://proceedings.mlr.press/v267/kim25g.html %V 267 %X With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with $<$ 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.
APA
Kim, H., Hoang, D.N., Kundu, A., Samragh, M. & Cho, M.. (2025). SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:30086-30098 Available from https://proceedings.mlr.press/v267/kim25g.html.

Related Material