How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4401-4477, 2026.

Abstract

Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work (Vardi and Shamir, 2021) showed that the implicit bias does not exist in the worst-case, or corresponds exactly to the minimum-$\ell_2$-norm interpolating solution under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-$\ell_2$-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/||\lambda||_1})$, where $n$ is the number of training examples and $\lambda$ denotes the spectrum of the data covariance matrix. Our results are obtained through a novel primal-dual analysis that carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and show that the ReLU activation pattern quickly stabilizes with high probability over random data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-lai26a, title = {How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?}, author = {Lai, Kuo-Wei and Wang, Guanghui and Tao, Molei and Muthukumar, Vidya}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {4401--4477}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/lai26a/lai26a.pdf}, url = {https://proceedings.mlr.press/v336/lai26a.html}, abstract = {Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work (Vardi and Shamir, 2021) showed that the implicit bias does not exist in the worst-case, or corresponds exactly to the minimum-$\ell_2$-norm interpolating solution under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-$\ell_2$-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/||\lambda||_1})$, where $n$ is the number of training examples and $\lambda$ denotes the spectrum of the data covariance matrix. Our results are obtained through a novel primal-dual analysis that carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and show that the ReLU activation pattern quickly stabilizes with high probability over random data.} }
Endnote
%0 Conference Paper %T How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression? %A Kuo-Wei Lai %A Guanghui Wang %A Molei Tao %A Vidya Muthukumar %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-lai26a %I PMLR %P 4401--4477 %U https://proceedings.mlr.press/v336/lai26a.html %V 336 %X Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work (Vardi and Shamir, 2021) showed that the implicit bias does not exist in the worst-case, or corresponds exactly to the minimum-$\ell_2$-norm interpolating solution under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-$\ell_2$-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/||\lambda||_1})$, where $n$ is the number of training examples and $\lambda$ denotes the spectrum of the data covariance matrix. Our results are obtained through a novel primal-dual analysis that carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and show that the ReLU activation pattern quickly stabilizes with high probability over random data.
APA
Lai, K., Wang, G., Tao, M. & Muthukumar, V.. (2026). How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4401-4477 Available from https://proceedings.mlr.press/v336/lai26a.html.

Related Material