CADA: Communication-Adaptive Distributed Adam

Tianyi Chen, Ziye Guo, Yuejiao Sun, Wotao Yin
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:613-621, 2021.

Abstract

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for largescale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communicationadaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-chen21a, title = { CADA: Communication-Adaptive Distributed Adam }, author = {Chen, Tianyi and Guo, Ziye and Sun, Yuejiao and Yin, Wotao}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {613--621}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/chen21a/chen21a.pdf}, url = {https://proceedings.mlr.press/v130/chen21a.html}, abstract = { Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for largescale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communicationadaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction. } }
Endnote
%0 Conference Paper %T CADA: Communication-Adaptive Distributed Adam %A Tianyi Chen %A Ziye Guo %A Yuejiao Sun %A Wotao Yin %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-chen21a %I PMLR %P 613--621 %U https://proceedings.mlr.press/v130/chen21a.html %V 130 %X Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for largescale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communicationadaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.
APA
Chen, T., Guo, Z., Sun, Y. & Yin, W.. (2021). CADA: Communication-Adaptive Distributed Adam . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:613-621 Available from https://proceedings.mlr.press/v130/chen21a.html.

Related Material