Accumulated Gradient Normalization

Joeri R. Hermans, Gerasimos Spanakis, Rico Möckel
Proceedings of the Ninth Asian Conference on Machine Learning, PMLR 77:439-454, 2017.

Abstract

This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.

Cite this Paper


BibTeX
@InProceedings{pmlr-v77-hermans17a, title = {Accumulated Gradient Normalization}, author = {Hermans, Joeri R. and Spanakis, Gerasimos and Möckel, Rico}, booktitle = {Proceedings of the Ninth Asian Conference on Machine Learning}, pages = {439--454}, year = {2017}, editor = {Zhang, Min-Ling and Noh, Yung-Kyun}, volume = {77}, series = {Proceedings of Machine Learning Research}, address = {Yonsei University, Seoul, Republic of Korea}, month = {15--17 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v77/hermans17a/hermans17a.pdf}, url = {https://proceedings.mlr.press/v77/hermans17a.html}, abstract = {This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.} }
Endnote
%0 Conference Paper %T Accumulated Gradient Normalization %A Joeri R. Hermans %A Gerasimos Spanakis %A Rico Möckel %B Proceedings of the Ninth Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Min-Ling Zhang %E Yung-Kyun Noh %F pmlr-v77-hermans17a %I PMLR %P 439--454 %U https://proceedings.mlr.press/v77/hermans17a.html %V 77 %X This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.
APA
Hermans, J.R., Spanakis, G. & Möckel, R.. (2017). Accumulated Gradient Normalization. Proceedings of the Ninth Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 77:439-454 Available from https://proceedings.mlr.press/v77/hermans17a.html.

Related Material