[edit]
Accelerating Gossip SGD with Periodic Global Averaging
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:1791-1802, 2021.
Abstract
Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity 1−β which measures the network connectivity. On large and sparse networks where 1−β→0, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging to accelerate Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from Ω(β4n3/(1−β)4) to Ω(β4n3H4) for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period H. Its transient-stage complexity is also superior to local SGD which has order Ω(n3H4). Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.