Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:15809-15827, 2023.

Abstract

Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-karakida23a, title = {Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias}, author = {Karakida, Ryo and Takase, Tomoumi and Hayase, Tomohiro and Osawa, Kazuki}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {15809--15827}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/karakida23a/karakida23a.pdf}, url = {https://proceedings.mlr.press/v202/karakida23a.html}, abstract = {Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.} }
Endnote
%0 Conference Paper %T Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias %A Ryo Karakida %A Tomoumi Takase %A Tomohiro Hayase %A Kazuki Osawa %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-karakida23a %I PMLR %P 15809--15827 %U https://proceedings.mlr.press/v202/karakida23a.html %V 202 %X Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.
APA
Karakida, R., Takase, T., Hayase, T. & Osawa, K.. (2023). Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:15809-15827 Available from https://proceedings.mlr.press/v202/karakida23a.html.

Related Material