Understanding Gradient Clipping In Incremental Gradient Methods

Jiang Qian, Yuren Wu, Bojin Zhuang, Shaojun Wang, Jing Xiao
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:1504-1512, 2021.

Abstract

We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-qian21a, title = { Understanding Gradient Clipping In Incremental Gradient Methods }, author = {Qian, Jiang and Wu, Yuren and Zhuang, Bojin and Wang, Shaojun and Xiao, Jing}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {1504--1512}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/qian21a/qian21a.pdf}, url = {https://proceedings.mlr.press/v130/qian21a.html}, abstract = { We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping. } }
Endnote
%0 Conference Paper %T Understanding Gradient Clipping In Incremental Gradient Methods %A Jiang Qian %A Yuren Wu %A Bojin Zhuang %A Shaojun Wang %A Jing Xiao %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-qian21a %I PMLR %P 1504--1512 %U https://proceedings.mlr.press/v130/qian21a.html %V 130 %X We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping.
APA
Qian, J., Wu, Y., Zhuang, B., Wang, S. & Xiao, J.. (2021). Understanding Gradient Clipping In Incremental Gradient Methods . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:1504-1512 Available from https://proceedings.mlr.press/v130/qian21a.html.

Related Material