Accumulated Gradient Normalization

Joeri R. Hermans; Gerasimos Spanakis; Rico Möckel

Accumulated Gradient Normalization

Joeri R. Hermans, Gerasimos Spanakis, Rico Möckel

Proceedings of the Ninth Asian Conference on Machine Learning, PMLR 77:439-454, 2017.

Abstract

This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.

Cite this Paper

BibTeX


@InProceedings{pmlr-v77-hermans17a,
  title = 	 {Accumulated Gradient Normalization},
  author = 	 {Hermans, Joeri R. and Spanakis, Gerasimos and Möckel, Rico},
  booktitle = 	 {Proceedings of the Ninth Asian Conference on Machine Learning},
  pages = 	 {439--454},
  year = 	 {2017},
  editor = 	 {Zhang, Min-Ling and Noh, Yung-Kyun},
  volume = 	 {77},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Yonsei University, Seoul, Republic of Korea},
  month = 	 {15--17 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v77/hermans17a/hermans17a.pdf},
  url = 	 {https://proceedings.mlr.press/v77/hermans17a.html},
  abstract = 	 {This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.}
}

Endnote

%0 Conference Paper
%T Accumulated Gradient Normalization
%A Joeri R. Hermans
%A Gerasimos Spanakis
%A Rico Möckel
%B Proceedings of the Ninth Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Min-Ling Zhang
%E Yung-Kyun Noh	
%F pmlr-v77-hermans17a
%I PMLR
%P 439--454
%U https://proceedings.mlr.press/v77/hermans17a.html
%V 77
%X This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.

APA


Hermans, J.R., Spanakis, G. & Möckel, R.. (2017). Accumulated Gradient Normalization. Proceedings of the Ninth Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 77:439-454 Available from https://proceedings.mlr.press/v77/hermans17a.html.

Accumulated Gradient Normalization

Abstract

Cite this Paper

Related Material