[edit]
Nonlinear transformers can perform inference-time feature learning
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:46554-46585, 2025.
Abstract
Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features in-context – a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(⟨\\boldsymbol{x},\\boldsymbol{\beta}⟩)$, which are low-dimensional nonlinear functions parameterized by feature vector $\\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features $\\boldsymbol{\beta}$ solely from test prompts (despite $\\boldsymbol {\beta}$ varying across different prompts), hence achieving an in-context statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the Softmax self-attention mechanism.