Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

Ziqiao Wang; Yongyi Mao

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

Ziqiao Wang, Yongyi Mao

Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:3514-3539, 2024.

Abstract

Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds.

Cite this Paper

BibTeX


@InProceedings{pmlr-v244-wang24b,
  title = 	 {Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States},
  author =       {Wang, Ziqiao and Mao, Yongyi},
  booktitle = 	 {Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {3514--3539},
  year = 	 {2024},
  editor = 	 {Kiyavash, Negar and Mooij, Joris M.},
  volume = 	 {244},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v244/main/assets/wang24b/wang24b.pdf},
  url = 	 {https://proceedings.mlr.press/v244/wang24b.html},
  abstract = 	 {Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds.}
}

Endnote

%0 Conference Paper
%T Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
%A Ziqiao Wang
%A Yongyi Mao
%B Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2024
%E Negar Kiyavash
%E Joris M. Mooij	
%F pmlr-v244-wang24b
%I PMLR
%P 3514--3539
%U https://proceedings.mlr.press/v244/wang24b.html
%V 244
%X Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds.

APA


Wang, Z. & Mao, Y.. (2024). Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 244:3514-3539 Available from https://proceedings.mlr.press/v244/wang24b.html.

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

Abstract

Cite this Paper

Related Material