Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$ Parametrization

Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:9614-9641, 2025.

Abstract

Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25cd, title = {Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$ Parametrization}, author = {Chen, Zixiang and Yang, Greg and Zhao, Qingyue and Gu, Quanquan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {9614--9641}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25cd/chen25cd.pdf}, url = {https://proceedings.mlr.press/v267/chen25cd.html}, abstract = {Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.} }
Endnote
%0 Conference Paper %T Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$ Parametrization %A Zixiang Chen %A Greg Yang %A Qingyue Zhao %A Quanquan Gu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25cd %I PMLR %P 9614--9641 %U https://proceedings.mlr.press/v267/chen25cd.html %V 267 %X Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
APA
Chen, Z., Yang, G., Zhao, Q. & Gu, Q.. (2025). Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$ Parametrization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:9614-9641 Available from https://proceedings.mlr.press/v267/chen25cd.html.

Related Material