On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic NonConvex Optimization
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:71747183, 2019.
Abstract
For SGD based distributed stochastic optimization, computation complexity, measured by the convergence rate in terms of the number of stochastic gradient calls, and communication complexity, measured by the number of internode communication rounds, are two most important performance metrics. The classical dataparallel implementation of SGD over N workers can achieve linear speedup of its convergence rate but incurs an internode communication round at each batch. We study the benefit of using dynamically increasing batch sizes in parallel SGD for stochastic nonconvex optimization by charactering the attained convergence rate and the required number of communication rounds. We show that for stochastic nonconvex optimization under the PL condition, the classical dataparallel SGD with exponentially increasing batch sizes can achieve the fastest known $O(1/(NT))$ convergence with linear speedup using only $\log(T)$ communication rounds. For general stochastic nonconvex optimization, we propose a Catalystlike algorithm to achieve the fastest known $O(1/\sqrt{NT})$ convergence with only $O(\sqrt{NT}\log(\frac{T}{N}))$ communication rounds.
Related Material


