Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities

Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8453-8479, 2025.

Abstract

We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine “next-scale prediction” framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25ak, title = {Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities}, author = {Chen, Yifang and Li, Xiaoyu and Liang, Yingyu and Shi, Zhenmei and Song, Zhao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {8453--8479}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25ak/chen25ak.pdf}, url = {https://proceedings.mlr.press/v267/chen25ak.html}, abstract = {We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine “next-scale prediction” framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.} }
Endnote
%0 Conference Paper %T Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities %A Yifang Chen %A Xiaoyu Li %A Yingyu Liang %A Zhenmei Shi %A Zhao Song %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25ak %I PMLR %P 8453--8479 %U https://proceedings.mlr.press/v267/chen25ak.html %V 267 %X We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine “next-scale prediction” framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
APA
Chen, Y., Li, X., Liang, Y., Shi, Z. & Song, Z.. (2025). Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8453-8479 Available from https://proceedings.mlr.press/v267/chen25ak.html.

Related Material