On the Maximum Hessian Eigenvalue and Generalization

Simran Kaur; Jeremy Cohen; Zachary Chase Lipton

On the Maximum Hessian Eigenvalue and Generalization

Simran Kaur, Jeremy Cohen, Zachary Chase Lipton

Proceedings on "I Can't Believe It's Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, PMLR 187:51-65, 2023.

Abstract

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly

$\lambda_{\rm max}$ , the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM), that directly optimize for flatness. Other works question the link between

$\lambda_{\rm max}$ and generalization. In this paper, we present findings that call

$\lambda_{\rm max}$ ’s influence on generalization further into question. We show that: (1) while larger learning rates reduce

$\lambda_{\rm max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change

$\lambda_{\rm max}$ without affecting generalization; (3) while SAM produces smaller

$\lambda_{\rm max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller

$\lambda_{\rm max}$ ; and (5) while batch-normalization does not consistently produce smaller

$\lambda_{\rm max}$ , it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to

$\lambda_{\rm max}$ ’s ability to explain generalization in neural networks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v187-kaur23a,
  title = 	 {On the Maximum Hessian Eigenvalue and Generalization},
  author =       {Kaur, Simran and Cohen, Jeremy and Lipton, Zachary Chase},
  booktitle = 	 {Proceedings on "I Can't Believe It's Not Better!  - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops},
  pages = 	 {51--65},
  year = 	 {2023},
  editor = 	 {Antorán, Javier and Blaas, Arno and Feng, Fan and Ghalebikesabi, Sahra and Mason, Ian and Pradier, Melanie F. and Rohde, David and Ruiz, Francisco J. R. and Schein, Aaron},
  volume = 	 {187},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v187/kaur23a/kaur23a.pdf},
  url = 	 {https://proceedings.mlr.press/v187/kaur23a.html},
  abstract = 	 {The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{\rm max}$ , the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM), that directly optimize for flatness. Other works question the link between $\lambda_{\rm max}$ and generalization. In this paper, we present findings that call $\lambda_{\rm max}$’s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{\rm max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{\rm max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{\rm max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{\rm max}$ ; and (5) while batch-normalization does not consistently produce smaller $\lambda_{\rm max}$ , it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{\rm max}$’s ability to explain generalization in neural networks.}
}

Endnote

%0 Conference Paper
%T On the Maximum Hessian Eigenvalue and Generalization
%A Simran Kaur
%A Jeremy Cohen
%A Zachary Chase Lipton
%B Proceedings on "I Can't Believe It's Not Better!  - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops
%C Proceedings of Machine Learning Research
%D 2023
%E Javier Antorán
%E Arno Blaas
%E Fan Feng
%E Sahra Ghalebikesabi
%E Ian Mason
%E Melanie F. Pradier
%E David Rohde
%E Francisco J. R. Ruiz
%E Aaron Schein	
%F pmlr-v187-kaur23a
%I PMLR
%P 51--65
%U https://proceedings.mlr.press/v187/kaur23a.html
%V 187
%X The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{\rm max}$ , the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM), that directly optimize for flatness. Other works question the link between $\lambda_{\rm max}$ and generalization. In this paper, we present findings that call $\lambda_{\rm max}$’s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{\rm max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{\rm max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{\rm max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{\rm max}$ ; and (5) while batch-normalization does not consistently produce smaller $\lambda_{\rm max}$ , it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{\rm max}$’s ability to explain generalization in neural networks.

APA


Kaur, S., Cohen, J. & Lipton, Z.C.. (2023). On the Maximum Hessian Eigenvalue and Generalization. Proceedings on "I Can't Believe It's Not Better!  - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, in Proceedings of Machine Learning Research 187:51-65 Available from https://proceedings.mlr.press/v187/kaur23a.html.

Related Material

Download PDF