Learning by Turning: Neural Architecture Aware Optimisation

Yang Liu; Jeremy Bernstein; Markus Meister; Yisong Yue

Learning by Turning: Neural Architecture Aware Optimisation

Yang Liu, Jeremy Bernstein, Markus Meister, Yisong Yue

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6748-6758, 2021.

Abstract

Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero’s memory footprint is square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron’s hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-liu21c,
  title = 	 {Learning by Turning: Neural Architecture Aware Optimisation},
  author =       {Liu, Yang and Bernstein, Jeremy and Meister, Markus and Yue, Yisong},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {6748--6758},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/liu21c/liu21c.pdf},
  url = 	 {https://proceedings.mlr.press/v139/liu21c.html},
  abstract = 	 {Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero’s memory footprint is   square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron’s hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.}
}

Endnote

%0 Conference Paper
%T Learning by Turning: Neural Architecture Aware Optimisation
%A Yang Liu
%A Jeremy Bernstein
%A Markus Meister
%A Yisong Yue
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-liu21c
%I PMLR
%P 6748--6758
%U https://proceedings.mlr.press/v139/liu21c.html
%V 139
%X Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero’s memory footprint is   square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron’s hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.

APA

Liu, Y., Bernstein, J., Meister, M. & Yue, Y.. (2021). Learning by Turning: Neural Architecture Aware Optimisation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6748-6758 Available from https://proceedings.mlr.press/v139/liu21c.html.

Learning by Turning: Neural Architecture Aware Optimisation

Abstract

Cite this Paper

Related Material