A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Chao Ma; Lei Wu; Weinan E

A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Chao Ma, Lei Wu, Weinan E

Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:671-692, 2022.

Abstract

The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate to 0 while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes, and divergence. In particular, Adam converges much smoother and even faster when the values of the two momentum factors are close to each other. This observation is particularly important for scientific computing tasks, for which the training process usually proceeds into the high precision regime.

Cite this Paper

BibTeX


@InProceedings{pmlr-v145-ma22a,
  title = 	 {A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms},
  author =       {Ma, Chao and Wu, Lei and E, Weinan},
  booktitle = 	 {Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference},
  pages = 	 {671--692},
  year = 	 {2022},
  editor = 	 {Bruna, Joan and Hesthaven, Jan and Zdeborova, Lenka},
  volume = 	 {145},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--19 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v145/ma22a/ma22a.pdf},
  url = 	 {https://proceedings.mlr.press/v145/ma22a.html},
  abstract = 	 {The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate to 0 while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes, and divergence. In particular, Adam converges much smoother and even faster when the values of the two momentum factors are close to each other. This observation is particularly important for scientific computing tasks, for which the training process usually proceeds into the high precision regime. }
}

Endnote

%0 Conference Paper
%T A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms
%A Chao Ma
%A Lei Wu
%A Weinan E
%B Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference
%C Proceedings of Machine Learning Research
%D 2022
%E Joan Bruna
%E Jan Hesthaven
%E Lenka Zdeborova	
%F pmlr-v145-ma22a
%I PMLR
%P 671--692
%U https://proceedings.mlr.press/v145/ma22a.html
%V 145
%X The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate to 0 while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes, and divergence. In particular, Adam converges much smoother and even faster when the values of the two momentum factors are close to each other. This observation is particularly important for scientific computing tasks, for which the training process usually proceeds into the high precision regime.

APA


Ma, C., Wu, L. & E, W.. (2022). A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms. Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, in Proceedings of Machine Learning Research 145:671-692 Available from https://proceedings.mlr.press/v145/ma22a.html.

Related Material

Download PDF