Nonlinear transformers can perform inference-time feature learning

Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, Taiji Suzuki
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:46554-46585, 2025.

Abstract

Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features in-context – a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(⟨\\boldsymbol{x},\\boldsymbol{\beta}⟩)$, which are low-dimensional nonlinear functions parameterized by feature vector $\\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features $\\boldsymbol{\beta}$ solely from test prompts (despite $\\boldsymbol {\beta}$ varying across different prompts), hence achieving an in-context statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the Softmax self-attention mechanism.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-nishikawa25a, title = {Nonlinear transformers can perform inference-time feature learning}, author = {Nishikawa, Naoki and Song, Yujin and Oko, Kazusato and Wu, Denny and Suzuki, Taiji}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {46554--46585}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nishikawa25a/nishikawa25a.pdf}, url = {https://proceedings.mlr.press/v267/nishikawa25a.html}, abstract = {Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features in-context – a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(⟨\\boldsymbol{x},\\boldsymbol{\beta}⟩)$, which are low-dimensional nonlinear functions parameterized by feature vector $\\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features $\\boldsymbol{\beta}$ solely from test prompts (despite $\\boldsymbol {\beta}$ varying across different prompts), hence achieving an in-context statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the Softmax self-attention mechanism.} }
Endnote
%0 Conference Paper %T Nonlinear transformers can perform inference-time feature learning %A Naoki Nishikawa %A Yujin Song %A Kazusato Oko %A Denny Wu %A Taiji Suzuki %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-nishikawa25a %I PMLR %P 46554--46585 %U https://proceedings.mlr.press/v267/nishikawa25a.html %V 267 %X Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features in-context – a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(⟨\\boldsymbol{x},\\boldsymbol{\beta}⟩)$, which are low-dimensional nonlinear functions parameterized by feature vector $\\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features $\\boldsymbol{\beta}$ solely from test prompts (despite $\\boldsymbol {\beta}$ varying across different prompts), hence achieving an in-context statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the Softmax self-attention mechanism.
APA
Nishikawa, N., Song, Y., Oko, K., Wu, D. & Suzuki, T.. (2025). Nonlinear transformers can perform inference-time feature learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:46554-46585 Available from https://proceedings.mlr.press/v267/nishikawa25a.html.

Related Material