[edit]
Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:4009-4081, 2024.
Abstract
We study the statistical and computational complexity of learning a target function f∗:\Rd→\R with \textit{additive structure}, that is, f∗(x)=1√M∑Mm=1fm(⟨x,vm⟩), where f1,f2,...,fM:\R→\R are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features {vm}Mm=1, and the number of additive tasks M grows with the dimensionality M≍dγ for γ≥0. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of “skills” that are often \textit{localized} in distinct parts of the trained network. We prove that a large subset of polynomial f∗ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks M and the \textit{information exponent} of fm, despite the unknown link function and M growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.