Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd Van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, Neil Houlsby
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7480-7512, 2023.

Abstract

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-dehghani23a, title = {Scaling Vision Transformers to 22 Billion Parameters}, author = {Dehghani, Mostafa and Djolonga, Josip and Mustafa, Basil and Padlewski, Piotr and Heek, Jonathan and Gilmer, Justin and Steiner, Andreas Peter and Caron, Mathilde and Geirhos, Robert and Alabdulmohsin, Ibrahim and Jenatton, Rodolphe and Beyer, Lucas and Tschannen, Michael and Arnab, Anurag and Wang, Xiao and Riquelme Ruiz, Carlos and Minderer, Matthias and Puigcerver, Joan and Evci, Utku and Kumar, Manoj and Steenkiste, Sjoerd Van and Elsayed, Gamaleldin Fathy and Mahendran, Aravindh and Yu, Fisher and Oliver, Avital and Huot, Fantine and Bastings, Jasmijn and Collier, Mark and Gritsenko, Alexey A. and Birodkar, Vighnesh and Vasconcelos, Cristina Nader and Tay, Yi and Mensink, Thomas and Kolesnikov, Alexander and Pavetic, Filip and Tran, Dustin and Kipf, Thomas and Lucic, Mario and Zhai, Xiaohua and Keysers, Daniel and Harmsen, Jeremiah J. and Houlsby, Neil}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {7480--7512}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/dehghani23a/dehghani23a.pdf}, url = {https://proceedings.mlr.press/v202/dehghani23a.html}, abstract = {The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.} }
Endnote
%0 Conference Paper %T Scaling Vision Transformers to 22 Billion Parameters %A Mostafa Dehghani %A Josip Djolonga %A Basil Mustafa %A Piotr Padlewski %A Jonathan Heek %A Justin Gilmer %A Andreas Peter Steiner %A Mathilde Caron %A Robert Geirhos %A Ibrahim Alabdulmohsin %A Rodolphe Jenatton %A Lucas Beyer %A Michael Tschannen %A Anurag Arnab %A Xiao Wang %A Carlos Riquelme Ruiz %A Matthias Minderer %A Joan Puigcerver %A Utku Evci %A Manoj Kumar %A Sjoerd Van Steenkiste %A Gamaleldin Fathy Elsayed %A Aravindh Mahendran %A Fisher Yu %A Avital Oliver %A Fantine Huot %A Jasmijn Bastings %A Mark Collier %A Alexey A. Gritsenko %A Vighnesh Birodkar %A Cristina Nader Vasconcelos %A Yi Tay %A Thomas Mensink %A Alexander Kolesnikov %A Filip Pavetic %A Dustin Tran %A Thomas Kipf %A Mario Lucic %A Xiaohua Zhai %A Daniel Keysers %A Jeremiah J. Harmsen %A Neil Houlsby %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-dehghani23a %I PMLR %P 7480--7512 %U https://proceedings.mlr.press/v202/dehghani23a.html %V 202 %X The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
APA
Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme Ruiz, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., Steenkiste, S.V., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J., Collier, M., Gritsenko, A.A., Birodkar, V., Vasconcelos, C.N., Tay, Y., Mensink, T., Kolesnikov, A., Pavetic, F., Tran, D., Kipf, T., Lucic, M., Zhai, X., Keysers, D., Harmsen, J.J. & Houlsby, N.. (2023). Scaling Vision Transformers to 22 Billion Parameters. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:7480-7512 Available from https://proceedings.mlr.press/v202/dehghani23a.html.

Related Material