Learned Optimizers that Scale and Generalize

Olga Wichrowska; Niru Maheswaranathan; Matthew W. Hoffman; Sergio Gómez Colmenarejo; Misha Denil; Nando Freitas; Jascha Sohl-Dickstein

Learned Optimizers that Scale and Generalize

Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Nando Freitas, Jascha Sohl-Dickstein

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3751-3760, 2017.

Abstract

Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse, optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on.

Cite this Paper

BibTeX


@InProceedings{pmlr-v70-wichrowska17a,
  title = 	 {Learned Optimizers that Scale and Generalize},
  author =       {Olga Wichrowska and Niru Maheswaranathan and Matthew W. Hoffman and Sergio G{\'o}mez Colmenarejo and Misha Denil and Nando de Freitas and Jascha Sohl-Dickstein},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {3751--3760},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/wichrowska17a/wichrowska17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/wichrowska17a.html},
  abstract = 	 {Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse, optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on.}
}

Endnote

%0 Conference Paper
%T Learned Optimizers that Scale and Generalize
%A Olga Wichrowska
%A Niru Maheswaranathan
%A Matthew W. Hoffman
%A Sergio Gómez Colmenarejo
%A Misha Denil
%A Nando Freitas
%A Jascha Sohl-Dickstein
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-wichrowska17a
%I PMLR
%P 3751--3760
%U https://proceedings.mlr.press/v70/wichrowska17a.html
%V 70
%X Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse, optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on.

APA


Wichrowska, O., Maheswaranathan, N., Hoffman, M.W., Colmenarejo, S.G., Denil, M., Freitas, N. & Sohl-Dickstein, J.. (2017). Learned Optimizers that Scale and Generalize. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:3751-3760 Available from https://proceedings.mlr.press/v70/wichrowska17a.html.

Learned Optimizers that Scale and Generalize

Abstract

Cite this Paper

Related Material