Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO’s 4000 TPU Months

Fady Rezk, Antreas Antoniou, Henry Gouk, Timothy Hospedales
Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, PMLR 239:65-83, 2023.

Abstract

We analyze VeLO (versatile learned optimzer, the largest scale attempt to train a general purpose “foundational” optimizer to date. VeLO was trained on thousands of machine learning tasks over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyper-parameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLcommons optimizer benchmark suite. We find that contrary to initial claims: (1) VeLO has a critical hyper-parameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO’s generality and the value of the investment in training it.

Cite this Paper


BibTeX
@InProceedings{pmlr-v239-rezk23a, title = {Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO’s 4000 TPU Months}, author = {Rezk, Fady and Antoniou, Antreas and Gouk, Henry and Hospedales, Timothy}, booktitle = {Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops}, pages = {65--83}, year = {2023}, editor = {Antorán, Javier and Blaas, Arno and Buchanan, Kelly and Feng, Fan and Fortuin, Vincent and Ghalebikesabi, Sahra and Kriegler, Andreas and Mason, Ian and Rohde, David and Ruiz, Francisco J. R. and Uelwer, Tobias and Xie, Yubin and Yang, Rui}, volume = {239}, series = {Proceedings of Machine Learning Research}, month = {16 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v239/rezk23a/rezk23a.pdf}, url = {https://proceedings.mlr.press/v239/rezk23a.html}, abstract = {We analyze VeLO (versatile learned optimzer, the largest scale attempt to train a general purpose “foundational” optimizer to date. VeLO was trained on thousands of machine learning tasks over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyper-parameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLcommons optimizer benchmark suite. We find that contrary to initial claims: (1) VeLO has a critical hyper-parameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO’s generality and the value of the investment in training it.} }
Endnote
%0 Conference Paper %T Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO’s 4000 TPU Months %A Fady Rezk %A Antreas Antoniou %A Henry Gouk %A Timothy Hospedales %B Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops %C Proceedings of Machine Learning Research %D 2023 %E Javier Antorán %E Arno Blaas %E Kelly Buchanan %E Fan Feng %E Vincent Fortuin %E Sahra Ghalebikesabi %E Andreas Kriegler %E Ian Mason %E David Rohde %E Francisco J. R. Ruiz %E Tobias Uelwer %E Yubin Xie %E Rui Yang %F pmlr-v239-rezk23a %I PMLR %P 65--83 %U https://proceedings.mlr.press/v239/rezk23a.html %V 239 %X We analyze VeLO (versatile learned optimzer, the largest scale attempt to train a general purpose “foundational” optimizer to date. VeLO was trained on thousands of machine learning tasks over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyper-parameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLcommons optimizer benchmark suite. We find that contrary to initial claims: (1) VeLO has a critical hyper-parameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO’s generality and the value of the investment in training it.
APA
Rezk, F., Antoniou, A., Gouk, H. & Hospedales, T.. (2023). Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO’s 4000 TPU Months. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, in Proceedings of Machine Learning Research 239:65-83 Available from https://proceedings.mlr.press/v239/rezk23a.html.

Related Material