Asynchronous Decentralized Parallel Stochastic Gradient Descent
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:30433052, 2018.
Abstract
Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduceSGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the bestpossible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (ADPSGD) satisfying all above expectations. Our theoretical analysis shows ADPSGD converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear speedup w.r.t. number of workers. Empirically, ADPSGD outperforms the best of decentralized parallel SGD (DPSGD), asynchronous parallel SGD (APSGD), and standard data parallel SGD (AllReduceSGD), often by orders of magnitude in a heterogeneous environment. When training ResNet50 on ImageNet with up to 128 GPUs, ADPSGD converges (w.r.t epochs) similarly to the AllReduceSGD, but each epoch can be up to 48x faster than its synchronous counterparts in a networksharing HPC environment. To the best of our knowledge, ADPSGD is the first asynchronous algorithm that achieves a similar epochwise convergence rate as AllReduceSGD, at an over 100GPU scale.
Related Material


