Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks

David Balduzzi, Brian McWilliams, Tony Butler-Yeoman
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:351-360, 2017.

Abstract

Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex; standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets, which furthermore matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation – a straightforward application of Taylor expansions to neural networks – and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets – that gradients are shattered – and investigates the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-balduzzi17c, title = {Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks}, author = {David Balduzzi and Brian McWilliams and Tony Butler-Yeoman}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {351--360}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/balduzzi17c/balduzzi17c.pdf}, url = {https://proceedings.mlr.press/v70/balduzzi17c.html}, abstract = {Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex; standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets, which furthermore matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation – a straightforward application of Taylor expansions to neural networks – and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets – that gradients are shattered – and investigates the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.} }
Endnote
%0 Conference Paper %T Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks %A David Balduzzi %A Brian McWilliams %A Tony Butler-Yeoman %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-balduzzi17c %I PMLR %P 351--360 %U https://proceedings.mlr.press/v70/balduzzi17c.html %V 70 %X Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex; standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets, which furthermore matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation – a straightforward application of Taylor expansions to neural networks – and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets – that gradients are shattered – and investigates the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.
APA
Balduzzi, D., McWilliams, B. & Butler-Yeoman, T.. (2017). Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:351-360 Available from https://proceedings.mlr.press/v70/balduzzi17c.html.

Related Material