How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Kuo-Wei Lai; Guanghui Wang; Molei Tao; Vidya Muthukumar

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4401-4477, 2026.

Abstract

Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work (Vardi and Shamir, 2021) showed that the implicit bias does not exist in the worst-case, or corresponds exactly to the minimum-$\ell_2$-norm interpolating solution under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-$\ell_2$-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/||\lambda||_1})$, where $n$ is the number of training examples and $\lambda$ denotes the spectrum of the data covariance matrix. Our results are obtained through a novel primal-dual analysis that carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and show that the ReLU activation pattern quickly stabilizes with high probability over random data.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-lai26a,
  title = 	 {How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?},
  author =       {Lai, Kuo-Wei and Wang, Guanghui and Tao, Molei and Muthukumar, Vidya},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {4401--4477},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/lai26a/lai26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/lai26a.html},
  abstract = 	 {Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work (Vardi and Shamir, 2021) showed that the implicit bias does not exist in the worst-case, or corresponds exactly to the minimum-$\ell_2$-norm interpolating solution under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-$\ell_2$-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/||\lambda||_1})$, where $n$ is the number of training examples and $\lambda$ denotes the spectrum of the data covariance matrix. Our results are obtained through a novel primal-dual analysis that carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and show that the ReLU activation pattern quickly stabilizes with high probability over random data.}
}

Endnote

%0 Conference Paper
%T How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?
%A Kuo-Wei Lai
%A Guanghui Wang
%A Molei Tao
%A Vidya Muthukumar
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-lai26a
%I PMLR
%P 4401--4477
%U https://proceedings.mlr.press/v336/lai26a.html
%V 336
%X Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work (Vardi and Shamir, 2021) showed that the implicit bias does not exist in the worst-case, or corresponds exactly to the minimum-$\ell_2$-norm interpolating solution under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-$\ell_2$-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/||\lambda||_1})$, where $n$ is the number of training examples and $\lambda$ denotes the spectrum of the data covariance matrix. Our results are obtained through a novel primal-dual analysis that carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and show that the ReLU activation pattern quickly stabilizes with high probability over random data.

APA

Lai, K., Wang, G., Tao, M. & Muthukumar, V.. (2026). How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4401-4477 Available from https://proceedings.mlr.press/v336/lai26a.html.

Related Material

Download PDF