In-Context Deep Learning via Transformer Models

Weimin Wu; Maojiang Su; Jerry Yao-Chieh Hu; Zhao Song; Han Liu

In-Context Deep Learning via Transformer Models

Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:67670-67718, 2025.

Abstract

We investigate the transformer’s capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-wu25aa,
  title = 	 {In-Context Deep Learning via Transformer Models},
  author =       {Wu, Weimin and Su, Maojiang and Hu, Jerry Yao-Chieh and Song, Zhao and Liu, Han},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {67670--67718},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wu25aa/wu25aa.pdf},
  url = 	 {https://proceedings.mlr.press/v267/wu25aa.html},
  abstract = 	 {We investigate the transformer’s capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.}
}

Endnote

%0 Conference Paper
%T In-Context Deep Learning via Transformer Models
%A Weimin Wu
%A Maojiang Su
%A Jerry Yao-Chieh Hu
%A Zhao Song
%A Han Liu
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-wu25aa
%I PMLR
%P 67670--67718
%U https://proceedings.mlr.press/v267/wu25aa.html
%V 267
%X We investigate the transformer’s capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

APA

Wu, W., Su, M., Hu, J.Y., Song, Z. & Liu, H.. (2025). In-Context Deep Learning via Transformer Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:67670-67718 Available from https://proceedings.mlr.press/v267/wu25aa.html.

In-Context Deep Learning via Transformer Models

Abstract

Cite this Paper

Related Material