On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

Haoyuan Sun; Ali Jadbabaie; Navid Azizan

On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

Haoyuan Sun, Ali Jadbabaie, Navid Azizan

Proceedings of The 37th International Conference on Algorithmic Learning Theory, PMLR 313:1-3, 2026.

Abstract

Transformer-based models demonstrate a remarkable ability for *in-context learning* (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal *linear self-attention* (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building upon this understanding, we investigate ICL for *nonlinear* function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, thereby highlighting a hard expressivity barrier for attention-only models. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the *gated linear units* (GLU), which is a standard component in modern Transformer architectures. We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss. Furthermore, our analysis reveals that the expressivity of a single such block is inherently limited by its dimensions. We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, effectively performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent. Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.

Cite this Paper

BibTeX

@InProceedings{pmlr-v313-sun26a,
  title = 	 {On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning},
  author =       {Sun, Haoyuan and Jadbabaie, Ali and Azizan, Navid},
  booktitle = 	 {Proceedings of The 37th International Conference on Algorithmic Learning Theory},
  pages = 	 {1--3},
  year = 	 {2026},
  editor = 	 {Telgarsky, Matus and Ullman, Jonathan},
  volume = 	 {313},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--26 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v313/main/assets/sun26a/sun26a.pdf},
  url = 	 {https://proceedings.mlr.press/v313/sun26a.html},
  abstract = 	 {Transformer-based models demonstrate a remarkable ability for *in-context learning* (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal *linear self-attention* (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building upon this understanding, we investigate ICL for *nonlinear* function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, thereby highlighting a hard expressivity barrier for attention-only models. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the *gated linear units* (GLU), which is a standard component in modern Transformer architectures. We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss. Furthermore, our analysis reveals that the expressivity of a single such block is inherently limited by its dimensions. We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, effectively performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent. Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.}
}

Endnote

%0 Conference Paper
%T On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning
%A Haoyuan Sun
%A Ali Jadbabaie
%A Navid Azizan
%B Proceedings of The 37th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Matus Telgarsky
%E Jonathan Ullman	
%F pmlr-v313-sun26a
%I PMLR
%P 1--3
%U https://proceedings.mlr.press/v313/sun26a.html
%V 313
%X Transformer-based models demonstrate a remarkable ability for *in-context learning* (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal *linear self-attention* (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building upon this understanding, we investigate ICL for *nonlinear* function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, thereby highlighting a hard expressivity barrier for attention-only models. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the *gated linear units* (GLU), which is a standard component in modern Transformer architectures. We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss. Furthermore, our analysis reveals that the expressivity of a single such block is inherently limited by its dimensions. We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, effectively performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent. Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.

APA

Sun, H., Jadbabaie, A. & Azizan, N.. (2026). On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning. Proceedings of The 37th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 313:1-3 Available from https://proceedings.mlr.press/v313/sun26a.html.

On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

Abstract

Cite this Paper

Related Material