Accelerating Gossip SGD with Periodic Global Averaging

Yiming Chen, Kun Yuan, Yingya Zhang, Pan Pan, Yinghui Xu, Wotao Yin
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:1791-1802, 2021.

Abstract

Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity 1β which measures the network connectivity. On large and sparse networks where 1β0, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging to accelerate Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from Ω(β4n3/(1β)4) to Ω(β4n3H4) for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period H. Its transient-stage complexity is also superior to local SGD which has order Ω(n3H4). Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-chen21y, title = {Accelerating Gossip SGD with Periodic Global Averaging}, author = {Chen, Yiming and Yuan, Kun and Zhang, Yingya and Pan, Pan and Xu, Yinghui and Yin, Wotao}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {1791--1802}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/chen21y/chen21y.pdf}, url = {https://proceedings.mlr.press/v139/chen21y.html}, abstract = {Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity $1-\beta$ which measures the network connectivity. On large and sparse networks where $1-\beta \to 0$, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging to accelerate Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from $\Omega(\beta^4 n^3/(1-\beta)^4)$ to $\Omega(\beta^4 n^3 H^4)$ for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period $H$. Its transient-stage complexity is also superior to local SGD which has order $\Omega(n^3 H^4)$. Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.} }
Endnote
%0 Conference Paper %T Accelerating Gossip SGD with Periodic Global Averaging %A Yiming Chen %A Kun Yuan %A Yingya Zhang %A Pan Pan %A Yinghui Xu %A Wotao Yin %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-chen21y %I PMLR %P 1791--1802 %U https://proceedings.mlr.press/v139/chen21y.html %V 139 %X Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity $1-\beta$ which measures the network connectivity. On large and sparse networks where $1-\beta \to 0$, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging to accelerate Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from $\Omega(\beta^4 n^3/(1-\beta)^4)$ to $\Omega(\beta^4 n^3 H^4)$ for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period $H$. Its transient-stage complexity is also superior to local SGD which has order $\Omega(n^3 H^4)$. Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.
APA
Chen, Y., Yuan, K., Zhang, Y., Pan, P., Xu, Y. & Yin, W.. (2021). Accelerating Gossip SGD with Periodic Global Averaging. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:1791-1802 Available from https://proceedings.mlr.press/v139/chen21y.html.

Related Material