Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Yingzhen Yang
Proceedings of The 37th International Conference on Algorithmic Learning Theory, PMLR 313:1-48, 2026.

Abstract

We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in ${\mathbb R}^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\epsilon \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\epsilon)$ with high probability, in contrast with the representative sample complexity $\Theta\big(d^{\ell_0} \max\set{\epsilon^{-2},\log d}\big)$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention is a two-stage process. In stage one, a novel and provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, is employed to select the ground truth channel number in the target function, $\ell_0$, among the initial $L \ge \ell_0$ channels in its activation function in the first layer with high probability. Such learnable channel selection is performed by efficient one-step GD on both layers of the NN, which achieves the goal of feature learning in learning low-degree polynomials. In stage two, the second layer of the network is trained by standard GD using the activation function with selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.

Cite this Paper


BibTeX
@InProceedings{pmlr-v313-yang26a, title = {Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention}, author = {Yang, Yingzhen}, booktitle = {Proceedings of The 37th International Conference on Algorithmic Learning Theory}, pages = {1--48}, year = {2026}, editor = {Telgarsky, Matus and Ullman, Jonathan}, volume = {313}, series = {Proceedings of Machine Learning Research}, month = {23--26 Feb}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v313/main/assets/yang26a/yang26a.pdf}, url = {https://proceedings.mlr.press/v313/yang26a.html}, abstract = {We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in ${\mathbb R}^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\epsilon \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\epsilon)$ with high probability, in contrast with the representative sample complexity $\Theta\big(d^{\ell_0} \max\set{\epsilon^{-2},\log d}\big)$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention is a two-stage process. In stage one, a novel and provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, is employed to select the ground truth channel number in the target function, $\ell_0$, among the initial $L \ge \ell_0$ channels in its activation function in the first layer with high probability. Such learnable channel selection is performed by efficient one-step GD on both layers of the NN, which achieves the goal of feature learning in learning low-degree polynomials. In stage two, the second layer of the network is trained by standard GD using the activation function with selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.} }
Endnote
%0 Conference Paper %T Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention %A Yingzhen Yang %B Proceedings of The 37th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Matus Telgarsky %E Jonathan Ullman %F pmlr-v313-yang26a %I PMLR %P 1--48 %U https://proceedings.mlr.press/v313/yang26a.html %V 313 %X We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in ${\mathbb R}^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\epsilon \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\epsilon)$ with high probability, in contrast with the representative sample complexity $\Theta\big(d^{\ell_0} \max\set{\epsilon^{-2},\log d}\big)$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention is a two-stage process. In stage one, a novel and provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, is employed to select the ground truth channel number in the target function, $\ell_0$, among the initial $L \ge \ell_0$ channels in its activation function in the first layer with high probability. Such learnable channel selection is performed by efficient one-step GD on both layers of the NN, which achieves the goal of feature learning in learning low-degree polynomials. In stage two, the second layer of the network is trained by standard GD using the activation function with selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.
APA
Yang, Y.. (2026). Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention. Proceedings of The 37th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 313:1-48 Available from https://proceedings.mlr.press/v313/yang26a.html.

Related Material