Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, Priya Nagpurkar
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84:803-812, 2018.

Abstract

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions, and a novel learning rate schedule to compensate for gradient staleness.

Cite this Paper


BibTeX
@InProceedings{pmlr-v84-dutta18a, title = {Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD}, author = {Dutta, Sanghamitra and Joshi, Gauri and Ghosh, Soumyadip and Dube, Parijat and Nagpurkar, Priya}, booktitle = {Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics}, pages = {803--812}, year = {2018}, editor = {Storkey, Amos and Perez-Cruz, Fernando}, volume = {84}, series = {Proceedings of Machine Learning Research}, month = {09--11 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v84/dutta18a/dutta18a.pdf}, url = {https://proceedings.mlr.press/v84/dutta18a.html}, abstract = {Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions, and a novel learning rate schedule to compensate for gradient staleness.} }
Endnote
%0 Conference Paper %T Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD %A Sanghamitra Dutta %A Gauri Joshi %A Soumyadip Ghosh %A Parijat Dube %A Priya Nagpurkar %B Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2018 %E Amos Storkey %E Fernando Perez-Cruz %F pmlr-v84-dutta18a %I PMLR %P 803--812 %U https://proceedings.mlr.press/v84/dutta18a.html %V 84 %X Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions, and a novel learning rate schedule to compensate for gradient staleness.
APA
Dutta, S., Joshi, G., Ghosh, S., Dube, P. & Nagpurkar, P.. (2018). Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 84:803-812 Available from https://proceedings.mlr.press/v84/dutta18a.html.

Related Material