Practical Performance Guarantees for Pipelined DNN Inference

Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash Prabhu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1655-1671, 2024.

Abstract

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner’s dilemma of deciding when a solution is good enough. To this end, we design novel mixed integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for $k \in \\{2, 4, 8, 16, 32, 64\\}$, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with $k = 16$ pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-archer24a, title = {Practical Performance Guarantees for Pipelined {DNN} Inference}, author = {Archer, Aaron and Fahrbach, Matthew and Liu, Kuikui and Prabhu, Prakash}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {1655--1671}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/archer24a/archer24a.pdf}, url = {https://proceedings.mlr.press/v235/archer24a.html}, abstract = {We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner’s dilemma of deciding when a solution is good enough. To this end, we design novel mixed integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for $k \in \\{2, 4, 8, 16, 32, 64\\}$, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with $k = 16$ pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.} }
Endnote
%0 Conference Paper %T Practical Performance Guarantees for Pipelined DNN Inference %A Aaron Archer %A Matthew Fahrbach %A Kuikui Liu %A Prakash Prabhu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-archer24a %I PMLR %P 1655--1671 %U https://proceedings.mlr.press/v235/archer24a.html %V 235 %X We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner’s dilemma of deciding when a solution is good enough. To this end, we design novel mixed integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for $k \in \\{2, 4, 8, 16, 32, 64\\}$, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with $k = 16$ pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.
APA
Archer, A., Fahrbach, M., Liu, K. & Prabhu, P.. (2024). Practical Performance Guarantees for Pipelined DNN Inference. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1655-1671 Available from https://proceedings.mlr.press/v235/archer24a.html.

Related Material