On the distributional properties of adaptive gradients

Zhiyi Zhang, Ziyin Liu
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:419-429, 2021.

Abstract

Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm-up of the Adam optimizer, contrary to what is believed in the current literature.

Cite this Paper


BibTeX
@InProceedings{pmlr-v161-zhang21a, title = {On the distributional properties of adaptive gradients}, author = {Zhang, Zhiyi and Liu, Ziyin}, booktitle = {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence}, pages = {419--429}, year = {2021}, editor = {de Campos, Cassio and Maathuis, Marloes H.}, volume = {161}, series = {Proceedings of Machine Learning Research}, month = {27--30 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v161/zhang21a/zhang21a.pdf}, url = {https://proceedings.mlr.press/v161/zhang21a.html}, abstract = {Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm-up of the Adam optimizer, contrary to what is believed in the current literature.} }
Endnote
%0 Conference Paper %T On the distributional properties of adaptive gradients %A Zhiyi Zhang %A Ziyin Liu %B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2021 %E Cassio de Campos %E Marloes H. Maathuis %F pmlr-v161-zhang21a %I PMLR %P 419--429 %U https://proceedings.mlr.press/v161/zhang21a.html %V 161 %X Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm-up of the Adam optimizer, contrary to what is believed in the current literature.
APA
Zhang, Z. & Liu, Z.. (2021). On the distributional properties of adaptive gradients. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:419-429 Available from https://proceedings.mlr.press/v161/zhang21a.html.

Related Material